11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ContentsPrefacexviiContributorsI The history of COPSS 11 A brief history of the Committee of Presidents of StatisticalSocieties (COPSS) 3Ingram Olkin1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.2 COPSS activities in the early years . . . . . . . . . . . . . . 61.3 COPSS activities in recent times . . . . . . . . . . . . . . . . 81.4 Awards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10IIReminiscences and personal reflections on careerpaths 212 Reminiscences of the Columbia University Departmentof Mathematical Statistics in the late 1940s 23Ingram Olkin2.1 Introduction: Pre-Columbia . . . . . . . . . . . . . . . . . . . 232.2 Columbia days . . . . . . . . . . . . . . . . . . . . . . . . . . 242.3 Courses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 A career in statistics 29Herman Chernoff3.1 Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.2 Postdoc at University of Chicago . . . . . . . . . . . . . . . . 323.3 University of Illinois and Stanford . . . . . . . . . . . . . . . 343.4 MIT and Harvard . . . . . . . . . . . . . . . . . . . . . . . . 384 “. . . how wonderful the field of statistics is. . . ” 41David R. Brillinger4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.2 The speech (edited some) . . . . . . . . . . . . . . . . . . . . 424.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45xxiv


vi5 An unorthodox journey to statistics: Equity issues, remarkson multiplicity 49Juliet Popper Shaffer5.1 Pre-statistical career choices . . . . . . . . . . . . . . . . . . 495.2 Becoming a statistician . . . . . . . . . . . . . . . . . . . . . 505.3 Introduction to and work in multiplicity . . . . . . . . . . . . 525.4 General comments on multiplicity . . . . . . . . . . . . . . . 546 Statistics before and after my COPSS Prize 59Peter J. Bickel6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.2 The foundation of mathematical statistics . . . . . . . . . . . 596.3 My work before 1979 . . . . . . . . . . . . . . . . . . . . . . 606.4 My work after 1979 . . . . . . . . . . . . . . . . . . . . . . . 626.5 Some observations . . . . . . . . . . . . . . . . . . . . . . . . 677 The accidental biostatistics professor 73Donna J. Brogan7.1 Public school and passion for mathematics . . . . . . . . . . 737.2 College years and discovery of statistics . . . . . . . . . . . . 747.3 Thwarted employment search after college . . . . . . . . . . 767.4 Graduate school as a fallback option . . . . . . . . . . . . . . 767.5 Master’s degree in statistics at Purdue . . . . . . . . . . . . 777.6 Thwarted employment search after Master’s degree . . . . . 777.7 Graduate school again as a fallback option . . . . . . . . . . 777.8 Dissertation research and family issues . . . . . . . . . . . . 787.9 Job offers — finally! . . . . . . . . . . . . . . . . . . . . . . . 797.10 Four years at UNC-Chapel Hill . . . . . . . . . . . . . . . . . 797.11 Thirty-three years at Emory University . . . . . . . . . . . . 807.12 Summing up and acknowledgements . . . . . . . . . . . . . . 818 Developing a passion for statistics 83Bruce G. Lindsay8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 838.2 The first statistical seeds . . . . . . . . . . . . . . . . . . . . 858.3 Graduate training . . . . . . . . . . . . . . . . . . . . . . . . 858.4 The PhD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 888.5 Job and postdoc hunting . . . . . . . . . . . . . . . . . . . . 928.6 The postdoc years . . . . . . . . . . . . . . . . . . . . . . . . 928.7 Starting on the tenure track . . . . . . . . . . . . . . . . . . 939 Reflections on a statistical career and their implications 97R. Dennis Cook9.1 Early years . . . . . . . . . . . . . . . . . . . . . . . . . . . . 979.2 Statistical diagnostics . . . . . . . . . . . . . . . . . . . . . . 100


viii15 We live in exciting times 157Peter G. Hall15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 15715.2 Living with change . . . . . . . . . . . . . . . . . . . . . . . 15915.3 Living the revolution . . . . . . . . . . . . . . . . . . . . . . 16116 The bright future of applied statistics 171Rafael A. Irizarry16.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17116.2 Becoming an applied statistician . . . . . . . . . . . . . . . . 17116.3 Genomics and the measurement revolution . . . . . . . . . . 17216.4 The bright future . . . . . . . . . . . . . . . . . . . . . . . . 17517 The road travelled: From statistician to statistical scientist 177Nilanjan Chatterjee17.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 17717.2 Kin-cohort study: My gateway to genetics . . . . . . . . . . 17817.3 Gene-environment interaction: Bridging genetics and theory ofcase-control studies . . . . . . . . . . . . . . . . . . . . . . . 17917.4 Genome-wide association studies (GWAS): Introduction to bigscience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18117.5 The post-GWAS era: What does it all mean? . . . . . . . . 18317.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18418 A journey into statistical genetics and genomics 189Xihong Lin18.1 The ’omics era . . . . . . . . . . . . . . . . . . . . . . . . . . 18918.2 My move into statistical genetics and genomics . . . . . . . . 19118.3 A few lessons learned . . . . . . . . . . . . . . . . . . . . . . 19218.4 A few emerging areas in statistical genetics and genomics . . 19318.5 Training the next generation statistical genetic and genomicscientists in the ’omics era . . . . . . . . . . . . . . . . . . . 19718.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 19919 Reflections on women in statistics in Canada 203Mary E. Thompson19.1 A glimpse of the hidden past . . . . . . . . . . . . . . . . . . 20319.2 Early historical context . . . . . . . . . . . . . . . . . . . . . 20419.3 A collection of firsts for women . . . . . . . . . . . . . . . . . 20619.4 Awards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20919.5 Builders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21019.6 Statistical practice . . . . . . . . . . . . . . . . . . . . . . . . 21219.7 The current scene . . . . . . . . . . . . . . . . . . . . . . . . 213


20 “The whole women thing” 217Nancy M. Reid20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 21720.2 “How many women are there in your department?” . . . . . 21820.3 “Should I ask for more money?” . . . . . . . . . . . . . . . . 22020.4 “I’m honored” . . . . . . . . . . . . . . . . . . . . . . . . . . 22120.5 “I loved that photo” . . . . . . . . . . . . . . . . . . . . . . . 22420.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22521 Reflections on diversity 229Louise M. Ryan21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 22921.2 Initiatives for minority students . . . . . . . . . . . . . . . . 23021.3 Impact of the diversity programs . . . . . . . . . . . . . . . . 23121.4 Gender issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 233IV Reflections on the discipline 23522 Why does statistics have two theories? 237Donald A.S. Fraser22.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 23722.2 65 years and what’s new . . . . . . . . . . . . . . . . . . . . 23922.3 Where do the probabilities come from? . . . . . . . . . . . . 24022.4 Inference for regular models: Frequency . . . . . . . . . . . . 24322.5 Inference for regular models: Bootstrap . . . . . . . . . . . . 24522.6 Inference for regular models: Bayes . . . . . . . . . . . . . . 24622.7 The frequency-Bayes contradiction . . . . . . . . . . . . . . . 24722.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24823 Conditioning is the issue 253James O. Berger23.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 25323.2 Cox example and a pedagogical example . . . . . . . . . . . 25423.3 Likelihood and stopping rule principles . . . . . . . . . . . . 25523.4 What it means to be a frequentist . . . . . . . . . . . . . . . 25723.5 Conditional frequentist inference . . . . . . . . . . . . . . . . 25923.6 Final comments . . . . . . . . . . . . . . . . . . . . . . . . . 26424 Statistical inference from a Dempster–Shafer perspective 267Arthur P. Dempster24.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 26724.2 Personal probability . . . . . . . . . . . . . . . . . . . . . . . 26824.3 Personal probabilities of “don’t know” . . . . . . . . . . . . . 26924.4 The standard DS protocol . . . . . . . . . . . . . . . . . . . 27124.5 Nonparametric inference . . . . . . . . . . . . . . . . . . . . 275ix


x24.6 Open areas for research . . . . . . . . . . . . . . . . . . . . . 27625 Nonparametric Bayes 281David B. Dunson25.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 28125.2 A brief history of NP Bayes . . . . . . . . . . . . . . . . . . . 28425.3 Gazing into the future . . . . . . . . . . . . . . . . . . . . . . 28726 How do we choose our default methods? 293Andrew Gelman26.1 Statistics: The science of defaults . . . . . . . . . . . . . . . 29326.2 Ways of knowing . . . . . . . . . . . . . . . . . . . . . . . . . 29526.3 The pluralist’s dilemma . . . . . . . . . . . . . . . . . . . . . 29726.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29927 Serial correlation and Durbin–Watson bounds 303T.W. Anderson27.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 30327.2 Circular serial correlation . . . . . . . . . . . . . . . . . . . . 30427.3 Periodic trends . . . . . . . . . . . . . . . . . . . . . . . . . 30527.4 Uniformly most powerful tests . . . . . . . . . . . . . . . . . 30527.5 Durbin–Watson . . . . . . . . . . . . . . . . . . . . . . . . . 30628 A non-asymptotic walk in probability and statistics 309Pascal Massart28.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 30928.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . 31028.3 Welcome to Talagrand’s wonderland . . . . . . . . . . . . . . 31528.4 Beyond Talagrand’s inequality . . . . . . . . . . . . . . . . . 31829 The past’s future is now: What will the present’s futurebring? 323Lynne Billard29.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 32329.2 Symbolic data . . . . . . . . . . . . . . . . . . . . . . . . . . 32429.3 Illustrations . . . . . . . . . . . . . . . . . . . . . . . . . . . 32529.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33130 Lessons in biostatistics 335Norman E. Breslow30.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 33530.2 It’s the science that counts . . . . . . . . . . . . . . . . . . . 33630.3 Immortal time . . . . . . . . . . . . . . . . . . . . . . . . . . 33830.4 Multiplicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34130.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345


31 A vignette of discovery 349Nancy Flournoy31.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 34931.2 CMV infection and clinical pneumonia . . . . . . . . . . . . 35031.3 Interventions . . . . . . . . . . . . . . . . . . . . . . . . . . . 35431.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35732 Statistics and public health research 359Ross L. Prentice32.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 35932.2 Public health research . . . . . . . . . . . . . . . . . . . . . . 36132.3 Biomarkers and nutritional epidemiology . . . . . . . . . . . 36232.4 Preventive intervention development and testing . . . . . . . 36332.5 Clinical trial data analysis methods . . . . . . . . . . . . . . 36532.6 Summary and conclusion . . . . . . . . . . . . . . . . . . . . 36533 Statistics in a new era for finance and health care 369Tze Leung Lai33.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 36933.2 Comparative effectiveness research clinical studies . . . . . . 37033.3 Innovative clinical trial designs in translational medicine . . 37133.4 Credit portfolios and dynamic empirical Bayes in finance . . 37333.5 Statistics in the new era of finance . . . . . . . . . . . . . . . 37533.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37634 Meta-analyses: Heterogeneity can be a good thing 381Nan M. Laird34.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 38134.2 Early years of random effects for meta-analysis . . . . . . . . 38234.3 Random effects and clinical trials . . . . . . . . . . . . . . . 38334.4 Meta-analysis in genetic epidemiology . . . . . . . . . . . . . 38534.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38735 Good health: Statistical challenges in personalizing diseaseprevention 391Alice S. Whittemore35.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 39135.2 How do we personalize disease risks? . . . . . . . . . . . . . 39135.3 How do we evaluate a personal risk model? . . . . . . . . . . 39335.4 How do we estimate model performance measures? . . . . . . 39435.5 Can we improve how we use epidemiological data for risk modelassessment? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39735.6 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 401xi


xii36 Buried treasures 405Michael A. Newton36.1 Three short stories . . . . . . . . . . . . . . . . . . . . . . . . 40536.2 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 40937 Survey sampling: Past controversies, current orthodoxy, andfuture paradigms 413Roderick J.A. Little37.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41337.2 Probability or purposive sampling? . . . . . . . . . . . . . . 41537.3 Design-based or model-based inference? . . . . . . . . . . . . 41637.4 A unified framework: Calibrated Bayes . . . . . . . . . . . . 42337.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42538 Environmental informatics: Uncertainty quantification in theenvironmental sciences 429Noel Cressie38.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 42938.2 Hierarchical statistical modeling . . . . . . . . . . . . . . . . 43038.3 Decision-making in the presence of uncertainty . . . . . . . . 43138.4 Smoothing the data . . . . . . . . . . . . . . . . . . . . . . . 43338.5 EI for spatio-temporal data . . . . . . . . . . . . . . . . . . . 43438.6 The knowledge pyramid . . . . . . . . . . . . . . . . . . . . . 44438.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44439 A journey with statistical genetics 451Elizabeth A. Thompson39.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 45139.2 The 1970s: Likelihood inference and the EM algorithm . . . 45239.3 The 1980s: Genetic maps and hidden Markov models . . . . 45439.4 The 1990s: MCMC and complex stochastic systems . . . . . 45539.5 The 2000s: Association studies and gene expression . . . . . 45739.6 The 2010s: From association to relatedness . . . . . . . . . . 45839.7 To the future . . . . . . . . . . . . . . . . . . . . . . . . . . . 45840 Targeted learning: From MLE to TMLE 465Mark van der Laan40.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 46540.2 The statistical estimation problem . . . . . . . . . . . . . . . 46740.3 The curse of dimensionality for the MLE . . . . . . . . . . . 46940.4 Super learning . . . . . . . . . . . . . . . . . . . . . . . . . . 47340.5 Targeted learning . . . . . . . . . . . . . . . . . . . . . . . . 47440.6 Some special topics . . . . . . . . . . . . . . . . . . . . . . . 47640.7 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 477


xiii41 Statistical model building, machine learning, and the ah-hamoment 481Grace Wahba41.1 Introduction: Manny Parzen and RKHS . . . . . . . . . . . . 48141.2 Regularization methods, RKHS and sparse models . . . . . . 49041.3 Remarks on the nature-nurture debate, personalized medicineand scientific literacy . . . . . . . . . . . . . . . . . . . . . . 49141.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49242 In praise of sparsity and convexity 497Robert J. Tibshirani42.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 49742.2 Sparsity, convexity and l 1 penalties . . . . . . . . . . . . . . 49842.3 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . 50042.4 The covariance test . . . . . . . . . . . . . . . . . . . . . . . 50042.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50343 Features of Big Data and sparsest solution in high confidenceset 507Jianqing Fan43.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 50743.2 Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . . 50843.3 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 50943.4 Spurious correlation . . . . . . . . . . . . . . . . . . . . . . . 51043.5 Incidental endogeneity . . . . . . . . . . . . . . . . . . . . . . 51243.6 Noise accumulation . . . . . . . . . . . . . . . . . . . . . . . 51543.7 Sparsest solution in high confidence set . . . . . . . . . . . . 51643.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52144 Rise of the machines 525Larry A. Wasserman44.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 52544.2 The conference culture . . . . . . . . . . . . . . . . . . . . . 52644.3 Neglected research areas . . . . . . . . . . . . . . . . . . . . 52744.4 Case studies . . . . . . . . . . . . . . . . . . . . . . . . . . . 52744.5 Computational thinking . . . . . . . . . . . . . . . . . . . . . 53344.6 The evolving meaning of data . . . . . . . . . . . . . . . . . 53444.7 Education and hiring . . . . . . . . . . . . . . . . . . . . . . 53544.8 If you can’t beat them, join them . . . . . . . . . . . . . . . 53545 A trio of inference problems that could win you a Nobel Prizein statistics (if you help fund it) 537Xiao-Li Meng45.1 Nobel Prize? Why not COPSS? . . . . . . . . . . . . . . . . 53745.2 Multi-resolution inference . . . . . . . . . . . . . . . . . . . . 539


xiv45.3 Multi-phase inference . . . . . . . . . . . . . . . . . . . . . . 54545.4 Multi-source inference . . . . . . . . . . . . . . . . . . . . . . 55145.5 The ultimate prize or price . . . . . . . . . . . . . . . . . . . 557V Advice for the next generation 56346 Inspiration, aspiration, ambition 565C.F. Jeff Wu46.1 Searching the source of motivation . . . . . . . . . . . . . . . 56546.2 Examples of inspiration, aspiration, and ambition . . . . . . 56646.3 Looking to the future . . . . . . . . . . . . . . . . . . . . . . 56747 Personal reflections on the COPSS Presidents’ Award 571Raymond J. Carroll47.1 The facts of the award . . . . . . . . . . . . . . . . . . . . . 57147.2 Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57147.3 Luck: Have a wonderful Associate Editor . . . . . . . . . . . 57247.4 Find brilliant colleagues . . . . . . . . . . . . . . . . . . . . . 57247.5 Serendipity with data . . . . . . . . . . . . . . . . . . . . . . 57447.6 Get fascinated: Heteroscedasticity . . . . . . . . . . . . . . . 57547.7 Find smart subject-matter collaborators . . . . . . . . . . . . 57547.8 After the Presidents’ Award . . . . . . . . . . . . . . . . . . 57748 Publishing without perishing and other career advice 581Marie Davidian48.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 58148.2 Achieving balance, and how you never know . . . . . . . . . 58248.3 Write it, and write it again . . . . . . . . . . . . . . . . . . . 58648.4 Parting thoughts . . . . . . . . . . . . . . . . . . . . . . . . . 59049 Converting rejections into positive stimuli 593Donald B. Rubin49.1 My first attempt . . . . . . . . . . . . . . . . . . . . . . . . . 59449.2 I’m learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 59449.3 My first JASA submission . . . . . . . . . . . . . . . . . . . 59549.4 Get it published! . . . . . . . . . . . . . . . . . . . . . . . . . 59649.5 Find reviewers who understand . . . . . . . . . . . . . . . . . 59749.6 Sometimes it’s easy, even with errors . . . . . . . . . . . . . 59849.7 It sometimes pays to withdraw the paper! . . . . . . . . . . . 59849.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60150 The importance of mentors 605Donald B. Rubin50.1 My early years . . . . . . . . . . . . . . . . . . . . . . . . . . 60550.2 The years at Princeton University . . . . . . . . . . . . . . . 606


50.3 Harvard University — the early years . . . . . . . . . . . . . 60850.4 My years in statistics as a PhD student . . . . . . . . . . . . 60950.5 The decade at ETS . . . . . . . . . . . . . . . . . . . . . . . 61050.6 Interim time in DC at EPA, at the University of Wisconsin,and the University of Chicago . . . . . . . . . . . . . . . . . 61150.7 The three decades at Harvard . . . . . . . . . . . . . . . . . 61250.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61251 Never ask for or give advice, make mistakes, acceptmediocrity, enthuse 615Terry Speed51.1 Never ask for or give advice . . . . . . . . . . . . . . . . . . . 61551.2 Make mistakes . . . . . . . . . . . . . . . . . . . . . . . . . . 61651.3 Accept mediocrity . . . . . . . . . . . . . . . . . . . . . . . . 61751.4 Enthuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61852 Thirteen rules 621Bradley Efron52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 62152.2 Thirteen rules for giving a really bad talk . . . . . . . . . . . 621xv


PrefaceStatistics is the science of data collection, analysis, and interpretation. It playsapivotalroleinmanydisciplines,includingenvironmental,health,economic,social, physical, and information sciences. Statistics not only helps advancescientific discovery in many fields but also influences the development of humanityand society. In an increasingly data-driven and information-rich world,statistics is ever more critical in formulating scientific problems in quantitativeterms, accounting for and communicating uncertainty, analyzing and learningfrom data, transforming numbers into knowledge and policy, and navigatingthe challenges for making data-driven decisions. The emergence of data scienceis also presenting statisticians with extraordinary opportunities for increasingthe impact of the field in the real world.This volume was commissioned in 2013 by the Committee of Presidents ofStatistical Societies (COPSS) to celebrate its 50th anniversary and the InternationalYear of Statistics. COPSS consists of five charter member societies:the American Statistical Association (ASA), the Institute of MathematicalStatistics (IMS), the Statistical Society of Canada (SSC), and the Easternand Western North American Regions of the International Biometric Society(ENAR and WNAR). COPSS is best known for sponsoring prestigious awardsgiven each year at the Joint Statistical Meetings, the largest annual gatheringof statisticians in North America. Through the contributions of a distinguishedgroup of statisticians, this volume aims to showcase the breadth and vibrancyof statistics, to describe current challenges and new opportunities, to highlightthe exciting future of statistical science, and to provide guidance for futuregenerations of statisticians.The 50 contributors to this volume are all past winners of at least one ofthe awards sponsored by COPSS: the R.A. Fisher Lectureship, the Presidents’Award, the George W. Snedecor Award, the Elizabeth L. Scott Award, andthe F.N. David Award. Established in 1964, the Fisher Lectureship honorsboth the contributions of Sir Ronald A. Fisher and a present-day statisticianfor their advancement of statistical theory and applications. The COPSS Presidents’Award, like the Fields Medal in mathematics or the John Bates ClarkMedal in economics, is an early career award. It was created in 1979 to honorastatisticianforoutstandingcontributionstostatistics.TheG.W.SnedecorAward, founded in 1976 and bestowed biennially, recognizes instrumental theoreticalwork in biometry. The E.L. Scott Award and F.N. David Award arealso given biennially to commend efforts in promoting the role of women inxvii


xviiiPast, Present, and Future of Statistical Sciencestatistics and to female statisticians who are leading exemplary careers; theseawards were set up in 1992 and 2001, respectively.This volume is not only about statistics and science, but also about peopleand their passion for discovery. It contains expository articles by distinguishedauthors on a broad spectrum of topics of interest in statistical education, research,and applications. Many of these articles are accessible not only to professionalstatisticians and graduate students, but also to undergraduates interestedin pursuing statistics as a career, and to all those who use statistics insolving real-world problems. Topics include reminiscences and personal reflectionson statistical careers, perspectives on the field and profession, thoughtson the discipline and the future of statistical science, as well as advice foryoung statisticians. A consistent theme of all the articles is the passion forstatistics enthusiastically shared by the authors. Their success stories inspire,give a sense of statistics as a discipline, and provide a taste of the exhilarationof discovery, success, and professional accomplishment.This volume has five parts. In Part I, Ingram Olkin gives a brief overviewof the 50-year history of COPSS. Part II consists of 11 articles by authors whoreflect on their own careers (Ingram Olkin, Herman Chernoff, Peter Bickel),share the wisdom they gained (Dennis Cook, Kathryn Roeder) and the lessonsthey learned (David Brillinger), describe their journeys into statistics andbiostatistics (Juliet Popper Schaffer, Donna Brogan), and trace their path tosuccess (Bruce Lindsay, Jeff Rosenthal). Mary Gray also gives an account ofher lifetime efforts to promote equity.Part III comprises nine articles devoted to the impact of statistical scienceon society (Steve Fienberg), statistical education (Iain Johnstone), the roleof statisticians in the interplay between statistics and science (Rafael Irizarryand Nilanjan Chatterjee), equity and diversity in statistics (Mary Thompson,Nancy Reid, and Louise Ryan), and the challenges of statistical science as weenter the era of big data (Peter Hall and Xihong Lin).Part IV consists of 24 articles, in which authors provide insight on past developments,current challenges, and future opportunities in statistical science.A broad spectrum of issues is addressed, including the foundations and principlesof statistical inference (Don Fraser, Jim Berger, Art Dempster), nonparametricstatistics (David Dunson), model fitting (Andrew Gelman), time seriesanalysis (Ted Anderson), non-asymptotic probability and statistics (PascalMassart), symbolic data analysis (Lynne Billard), statistics in medicine andpublic health (Norman Breslow, Nancy Flournoy, Ross Prentice, Nan Laird,Alice Whittemore), environmental statistics (Noel Cressie), health care andfinance (Tze Leung Lai), statistical genetics and genomics (Elizabeth Thompson,Michael Newton), survey sampling (Rod Little), targeted learning (Markvan der Laan), statistical techniques for big data analysis, machine learning,and statistical learning (Jianqing Fan, Rob Tibshirani, Grace Wahba, LarryWasserman). This part concludes with “a trio of inference problems that couldwin you a Nobel Prize in statistics” offered by Xiao-Li Meng.


PrefacexixPart V comprises seven articles, in which six senior statisticians share theirexperience and provide career advice. Jeff Wu talks about inspiration, aspiration,and ambition as sources of motivation; Ray Carroll and Marie Davidiangive tips for success in research and publishing related to the choice of researchtopics and collaborators, familiarity with the publication process, and effectivecommunication; and Terry Speed speaks of the necessity to follow one’s ownpath and to be enthusiastic. Don Rubin proposed two possible topics: learningfrom failure and learning from mentors. As they seemed equally attractive, weasked him to do both. The book closes with Brad Efron’s “thirteen rules forgiving a really bad talk.”We are grateful to COPSS and its five charter member societies for supportingthis book project. Our gratitude extends to Bhramar Mukherjee, formersecretary and treasurer of COPSS; and to Jane Pendergast and MauraStokes, current chair and secretary/treasurer of COPSS for their efforts insupport of this book. For their help in planning, we are also indebted to themembers of COPSS’ 50th anniversary celebration planning committee, JoelGreenhouse, John Kittelson, Christian Léger, Xihong Lin, Bob Rodriguez, andJeff Wu.Additional funding for this book was provided by the International ChineseStatistical Society, the International Indian Statistical Association, andthe Korean International Statistical Society. We thank them for their sponsorshipand further acknowledge the substantial in-kind support provided bythe Institut des sciences mathématiques du Québec.Last but not least, we would like to express our deep appreciation toHeidi Sestrich from Carnegie Mellon University for her technical assistance,dedication, and effort in compiling this volume. Thanks also to Taylor andFrancis, and especially senior editor Rob Calver, for their help and support.With the publisher’s authorization, this book’s content is freely available atwww.copss.org so that it can benefit as many people as possible.We hope that this volume will inspire you and help you develop the samepassion for statistics that we share with the authors. Happy reading!The editorsXihong LinHarvard UniversityBoston, MADavid L. BanksDuke UniversityDurham, NCDavid W. ScottRice UniversityHouston, TXChristian GenestMcGill UniversityMontréal, QCGeert MolenberghsUniversiteit Hasselt and KU LeuvenBelgiumJane-Ling WangUniversity of CaliforniaDavis, CA


ContributorsTheodore W. AndersonStanford UniversityStanford, CAJames O. BergerDuke UniversityDurham, NCPeter J. BickelUniversity of CaliforniaBerkeley, CALynne BillardUniversity of GeorgiaAthens, GANorman E. BreslowUniversity of WashingtonSeattle, WADavid R. BrillingerUniversity of CaliforniaBerkeley, CADonna J. BroganEmory UniversityAtlanta, GARaymond J. CarrollTexas A&M UniversityCollege Station, TXNilanjan ChatterjeeNational Cancer InstituteBethesda, MDHerman ChernoffHarvard UniversityCambridge, MAR. Dennis CookUniversity of MinnesotaMinneapolis, MNNoel CressieUniversity of WollongongWollongong, NSW, AustraliaMarie DavidianNorth Carolina State UniversityRaleigh, NCArthur P. DempsterHarvard UniversityCambridge, MADavid B. DunsonDuke UniversityDurham, NCBradley EfronStanford UniversityStanford, CAJianquing FanPrinceton UniversityPrinceton, NJStephen E. FienbergCarnegie Mellon UniversityPittsburgh, PAxxi


xxiiNancy FlournoyUniversity of MissouriColumbia, MODonald A.S. FraserUniversity of TorontoToronto, ONAndrew GelmanColumbia UniversityNew York, NYMary W. GrayAmerican UniversityWashington, DCPast, Present, and Future of Statistical SciencePascal MassartUniversité de Paris-SudOrsay, FranceXiao-Li MengHarvard School of Public HealthBoston, MAMichael A. NewtonUniversity of WisconsinMadison, WIIngram OlkinStanford UniversityStanford, CAPeter G. HallUniversity of Melbourne, AustraliaUniversity of California, Davis, CARafael A. IrizarryDana-Farber Cancer Institute andHarvard School of Public HealthBoston, MAIain M. JohnstoneStanford UniversityStanford, CATze Leung LaiStanford UniversityStanford, CANan M. LairdHarvard School of Public HealthBoston, MAXihong LinHarvard School of Public HealthBoston, MABruce G. LindsayPennsylvania State UniversityUniversity Park, PARoss PrenticeFred HutchinsonCancer Research CenterSeattle, WANancy M. ReidUniversity of TorontoToronto, ONKathryn RoederCarnegie Mellon UniversityPittsburgh, PAJeffrey S. RosenthalUniversity of TorontoToronto, ONDonald B. RubinHarvard UniversityCambridge, MALouise M. RyanUniversity of Technology SydneySydney, AustraliaJuliet Popper ShafferUniversity of CaliforniaBerkeley, CARoderick J. LittleUniversity of MichiganAnn Arbor, MI


ContributorsxxiiiTerry SpeedUniversity of CaliforniaBerkeley, CAWalter and Eliza Hall Institute ofMedical ResearchMelbourne, AustraliaElizabeth A. ThompsonUniversity of WashingtonSeattle, WAMary E. ThompsonUniversity of WaterlooWaterloo, ONRobert J. TibshiraniStanford UniversityStanford, CAMark van der LaanUniversity of CaliforniaBerkeley, CAGrace WahbaUniversity of WisconsinMadison, WILarry A. WassermanCarnegie Mellon UniversityPittsburgh, PAAlice S. WhittemoreStanford UniversityStanford, CAC.F. Jeff WuGeorgia Institute of TechnologyAtlanta, GA


Part IThe history of COPSS


1AbriefhistoryoftheCommitteeofPresidents of Statistical Societies (COPSS)Ingram OlkinDepartment of StatisticsStanford University, Stanford, CAShortly after becoming Chair of COPSS in 1992, I collated some of theorganization’s archival history. At that time there was already a 1972document prepared by Walter T. Federer who was Chair starting in1965. The following is a composite of Federer’s history coupled withmy update in 1994, together with a review of recent activities.1.1 IntroductionIn 1958–59, the American Statistical Association (ASA), the Biometric Society(ENAR), and the Institute of Mathematical Statistics (IMS) initiateddiscussions to study relationships among statistical societies. Each of the threeorganizations often appointed a committee to perform similar or even identicalduties, and communication among these and other groups was not alwayswhat was desired. Thus, in order to eliminate duplication of work, to improvecommunication among statistical societies, and to strengthen the scientificvoice of statistics, the ASA, under the leadership of Rensis Likert, Morris H.Hansen, and Donald C. Riley, appointed a Committee to Study RelationshipsAmong Statistical Societies (CONTRASTS). This committee was chaired byFrederick Mosteller. A series of campus discussions was initiated by membersof CONTRASTS in order to obtain a broad base of opinion for a possibleorganization of statistical societies.A grant of $9,000 was obtained from the Rockefeller Foundation by ASAto finance a series of discussions on the organizational needs of North Americanstatisticians. Subsequently, an inter-society meeting to discuss relationshipsamong statistical societies was held from Friday evening of September16 through Sunday morning of September 18, 1960, at the Sterling Forest3


4 A brief history of COPSSOnchiota Conference Center in Tuxedo, New York. An attempt was madeat inclusiveness, and 23 cognate societies sent representatives in addition toWallace O. Fenn of the American Institute of Biological Sciences, G. BaleyPrice of the Conference Board of Mathematical Sciences, and May Robinsonof the Brookings Institute. See The American Statistician, 1960, vol. 14, no. 4,pp. 2–3, for a complete listing of the representatives.Mr. Fenn described the origin of the American Institute of BiologicalSciences (AIBS) and pointed out their accomplishments after organization.Mr. Price discussed the background and importance of the Conference Boardof Mathematical Sciences. The main motion for future action that was passedwas to the effect that a federation of societies concerned with statistics shouldbe organized to consider some or all of the following items: (1) publicity;(2) publication; (3) bulletin; (4) newsletter; (5) subscription exchange; (6) abstracts;(7) translations; (8) directors; (9) national roster; (10) recruitment;(11) symposia; (12) visiting lecturer program and summer institutes; (13) jointstudies; (14) films and TV; (15) Washington office; (16) nominations for nationalcommittees; (17) fellowships at national levels; (18) cooperative apprenticetraining program; (19) international cooperation. It is satisfying to notethat many of these activities came to fruition. A committee was appointed todraft a proposal for a federation of statistical societies by January 1, 1961.1.1.1 The birth of COPSSIt was not until December 9, 1961, that a meeting was held in New York todiscuss the proposed federation and the roles of ASA, ENAR–WNAR, andIMS in such an organization. A more formal meeting of the presidents, secretaries,and other members of the ASA, IMS, and the Biometric Society(ENAR) was held at the Annual Statistics Meetings in December of 1961.At this meeting the Committee of Presidents of Statistical Societies (COPSS)was essentially born. It was agreed that the president, the secretary, and onesociety-designated officer of ASA and IMS, the president and secretary ofENAR, the president of WNAR, and one member-at-large would form theCOPSS Committee (Amstat News, October1962,p.1).TheExecutiveCommitteeof COPSS was to be composed of the presidents of ASA, ENAR, andIMS and the secretary of ASA, with each president to serve as Chairman ofthe Committee for four months of the year.Philip Hauser, then President of the ASA, reported on the deliberation ofCOPSS at the Annual Meeting. Six joint committees were established withrepresentatives from ASA, IMS, ENAR, and WNAR. The charges are describedin his report as follows.1. The Joint Committee on Professional Standards is charged with consideringthe problems relating to standards for professional statisticians andwith recommending means to maintain such standards.


I. Olkin 52. The Joint Committee on the Career Brochure will be charged generallywith considerations relating to statistics as a career. A specific continuingtask will be the preparation and revision of a Career Brochure.3. The Joint Committee on Educational Opportunities will be charged withthe preparation of a brochure designed to inform students and others interestedin statistics as a career of the appropriate available training facilities,both graduate and undergraduate.4. The Joint Committee for Liaison with the American Association for theAdvancement of Science (AAAS) will represent the statistics profession inanewsectionwithintheAAAS.5. The Joint Committee on Organizational Changes is charged with the studyof constitutional changes among the societies and with recommendingthose which may facilitate more effective and efficient collaboration betweenthem and between members of the profession generally.6. The Joint Committee on News and Notes has the task of working outarrangements for The American Statistician to carry news and notes forthe Institute of Mathematical Statistics and the Biometric Society as wellas for ASA.As the report continued, creation of these six joint committees was to behailed as a major step forward toward more effective collaboration betweenstatistical societies in pursuing common ends.The joint publication of a directory of members was part of the initialthrust for avoiding duplication, and for a cooperative venture on the part ofstatistical societies. However, the needs of the member societies did not alwayscoincide, and directories have been published jointly in 1973, 1978, 1987, and1991, and separately by IMS in 1981. In 1996 it was generally agreed to planfor a joint directory to appear no less frequently than every three years. Withthe advent of computer technology, an up-to-date member directory becameavailable at any time. However, the joint directory was concrete evidence ofthe benefits of collaboration between the statistical societies.A meeting of the full COPSS Committee was held on Monday, September10, 1962, to discuss a memorandum prepared by H.L. Lucas on (7) COPSSand its activities; (8) membership of COPSS Committees; (9) charges toCOPSS Committees; and (10) Committee reports. The revised version of thisreport became the official document of COPSS. In addition to discussing thetopics listed, COPSS established the following committees: (11) Standardizationof Symbols and Notations; and (12) Memorial Session in Honor of SirRonald A. Fisher, who had died in Adelaide, Australia, on July 29.The COPSS Committee on Organizational Changes and COPSS ExecutiveCommittee held a meeting on Friday, March 1, 1963, to discuss anumber of items dealing with cooperative arrangements among societies inCOPSS, including certain possible structural changes within the societies.


6 A brief history of COPSSThe items discussed were: (1) a national committee on statistics; (2) financialneeds of COPSS; (3) liaison with related societies; (4) improvement of intersocietyarrangements with respect to The American Statistician; (5) MathematicalStatistics Section of ASA; (6) implications of reorganization to IMS;(7) progress in coordination of ENAR and WNAR; (8) recommendations onSections of ASA, other societies affiliated with ASA, and for improvement instructure and activities of COPSS; and (9) joint billings of ASA, ENAR, IMS,and WNAR.Two meetings of COPSS on August 26, 1963, and September 6, 1963,were held to consider a national committee on statistics, action on a reporton availability of new statisticians, review of reports from COPSS Committees,a committee on liaison with related societies, the problem of recruitingcensus enumerators, and a proposal for publishing statistical tables. The 13thCOPSS Committee, Liaison with Statistical Societies, was appointed duringthe summer of 1963.At their meeting on Thursday, January 16, 1964, the COPSS ExecutiveCommittee considered the distribution of the new edition of Careers in Statistics,a national committee on statistics, recommendations of the Committee onStandardization of Symbols and Notation, suggestions regarding the LiaisonCommittee, availability of statisticians, and other items.At the meeting of COPSS held on Tuesday, December 29, 1964, themember-at-large, Walter T. Federer, was elected as Chairman and ExecutiveSecretary of COPSS for a three-year term, 1965–67. Federer was subsequentlyreappointed for a second three-year term, 1968–70, a one-year term in 1971,and a second one-year term in 1972. After this change the Executive Committeeno longer met, so the ASA and IMS both designated the President-Electas the other officer on COPSS, and some committees were added and othersdisbanded.1.2 COPSS activities in the early yearsIt was decided that the minutes of COPSS meetings were of sufficient generalinterest to be published in The American Statistician. Meetingminutesarefound in the following issues: February 1964, p. 2 (meeting of January 16,1964); April 1969, p. 37 (meeting of August 21, 1968); June 1970, p. 2 (meetingof August 21, 1969); April 1972, p. 38 (meeting of December 27, 1970); April1972, p. 40 (meeting of August 23, 1971). In addition, the minutes of theOnchiota Conference Center meeting appear on p. 2 of the October 1960 issue(see also p. 41).Membership lists of COPSS Committees were published in Amstat Newsas follows: February 1963, p. 31; April 1965, p. 61; April 1966, p. 44; April


I. Olkin 71967, p. 48; April 1968, p. 40; February 1969, p. 39; April 1970, p. 48; April1971, p. 47; April 1972, p. 55.Other Amstat News citations relating to work on COPSS Committees arethe “Brochure on Statistics as a Career” (April 1962, p. 4), the “AAAS ElectsStatisticians as Vice Presidents” (April 1962, p. 4), “Statistics Section (U) ofthe AAAS, ” by M.B. Ullman (February 1964, p. 9), and “Recommended Standardsfor Statistical Symbols and Notation,” by M. Halperin, H.O. Hartley(Chairman), and P.G. Hoel (June 1965, p. 12).The Academic Programs Committee was most beneficial, especiallythrough the work of Franklin Graybill in several publications of the Committeeon the Undergraduate Program in Mathematics (CUPM) relating tostatistics, and of Paul Minton in preparing a list for Amstat News (October1970, December 1971) of US and Canadian schools that offer degrees instatistics. The work of the Bernard Greenberg committee on preparing the“Careers in Statistics” brochure was particularly commendable as evidencedby the fact that several hundred thousand copies of the brochure were distributed.The work of other Committees as well helped COPSS to achieve itsgoals. To further their efforts, it was suggested that the committee membersbe placed on a rotating basis of three-year terms whenever possible, that certainmembers be considered ex-officio members of the committees, that theCommittee for Liaison with Other Statistical Societies be studied to find moreeffective means of communication between societies, and that the ExecutiveCommittee of COPSS consider holding additional meetings to consider waysof strengthening statistics in the scientific community and nationally.In 1965 the Conference Board of the Mathematical Sciences (CBMS) beganconducting a survey of the state of undergraduate mathematical and statisticalsciences in the nation. This survey is conducted every five years and ten fullreports have been issued, the latest being in 2010. With its statistical expertise,COPSS participated in the 1990 and 1995 surveys.COPSS was represented at meetings of the Conference Board of MathematicalSciences, and this served to bring concerns of the statistical communityto the attention of the cognate mathematics societies.A chronic concern related to the annual meetings of the statistical sciences.For many years the IMS alternated meeting with the ASA in one yearand with the American Mathematical Association and Mathematical Associationof America in another year. Somehow schedules did not always meshgeographically. This led to a considerable amount of negotiation. At one pointhaving Joint Statistical Meetings (JSM) with all societies included solved thisproblem.In the early days of COPSS the position of Chair rotated among the Societies.Later a Chair and Secretary/Treasurer were chosen by the chairs of themember societies. These are listed in Table 1.1.


8 A brief history of COPSSTABLE 1.1List of COPSS Chairs and Secretary/Treasurers.Period Chair Secretary/Treasurer1994–96 Ingram Olkin Lisa Weissfeld1997–00 Marvin Zelen Vickie Hertzberg2001–03 Sallie Keller Aparna Huzurbazar2004–06 Linda Young Karen Bandeen-Roche2007–09 Jessica Utts Madhuri Mulekar2010–12 Xihong Lin Bhramar Mukherjee2013–15 Jane Pendergast Maura Stokes1.3 COPSS activities in recent timesFrom the beginning of the 1990s, COPSS has been an established organizationwith a host of activities. The main program is described below, along with anumber of notable projects.The joint “Careers” brochure was revised in 1991 and again in 1993, andwas mailed to over 100,000 individuals, schools, and institutions. It has beensubsequently revised by a committee consisting of Donald Bentley (Chair),Judith O’Fallon (ASA), Jeffrey Witner (IMS), Keith Soyer (ENAR), KevinCain (WNAR), and Cyntha Struthers representing the Statistical Society ofCanada (SSC). The latter society joined COPSS as a member in 1981.A task force was formed to recommend a revision of the 1991 MathematicsSubject Classification in Mathematical Reviews (MR). A committee (DavidAldous, Wayne Fuller, Robb Muirhead, Ingram Olkin, Emanuel Parzen, BruceTrumbo) met at the University of Michigan with Executive Editor RonaldBabbit and also with editors of the Zentralblatt für Mathematik.In 2011 the US Centers for Medicare and Medicaid Services (CMS) askedCOPSS to prepare a white paper on “Statistical Issues on Assessing HospitalPerformance.” Then Chair, Xihong Lin formed a committee consisting of ArleneAsh, Stephen Fienberg, Thomas Louis, Sharon-Lise Normand, ThérèseStukel, and Jessica Utts to undertake this task. The report was to provideevaluation and guidance on statistical approaches used to estimate hospitalperformance metrics, with specific attention to statistical issues identified bythe CMS and stakeholders (hospitals, consumers, and insurers). Issues relatedto modeling hospital quality based on outcomes using hierarchical generalizedlinear models. The committee prepared a very thoughtful and comprehensivereport, which was submitted to CMS on November 30, 2011, and was postedon the CMS website.


I. Olkin 91.3.1 The Visiting Lecturer Program in StatisticsThe Visiting Lecturer Program (VLP) in Statistics was a major undertakingby COPSS. At the Annual Meeting of the IMS held at Stanford University inAugust 1960, discussion in the Council of the Institute, program presentations,and general comment forcibly re-emphasized the need to attract many morecompetent young people to professional careers in statistics. Governmental,educational, and industrial requirements for trained statisticians were not beingmet, and new positions were being created at a faster rate than the outputof new statisticians could address. The difficulty was compounded by a paucityof information about careers in statistics, by the loss of instructional personnelto higher-paying, non-academic employment, and by competition from othersciences for students with mathematical skills. A proposal for a program ofvisiting scientists in statistics covering the years 1962–67 was drawn up andpresented to the National Science Foundation. The program was funded in1962 by means of a three-year NSF grant to set up the Visiting LecturerProgram in Statistics for 1963–64 and 1964–65 under the Chairmanship ofJack C. Kiefer. The VLP was administered by the IMS but became a COPSSCommittee because of the nature of its activities.The original rationale for the VLP was described as follows.“Statistics is a very broad and exciting field of work. The mainpurpose of this program is to convey this excitement to students andothers who may be interested. Specifically, we hope that this programwill:1. Provide information on the nature of modern statistics.2. Illustrate the importance of statistics in all fields of scientific endeavor,particularly those involving experimental research, andto encourage instruction in statistics to students in all academicareas and at all levels.3. Create an awareness of the opportunities for careers in statisticsfor students with high quantitative and problem-solving abilitiesand to encourage them to seek advanced training in statistics.4. Provide information and advice to university and college facultiesand students on the present availability of advanced training instatistics.5. Encourage the development of new courses and programs instatistics.”Over the years the objectives changed somewhat, and by 1995 the Programhad five similar main objectives: (1) to provide education and information onthe nature and scope of modern statistics and to correct misconceptions heldin regard to the science; (2) to establish and emphasize the role that statisticsplays in research and practice in all fields of scientific endeavor, particularlythose involving experimental research, and to encourage instruction in statisticaltheory and application to students in all academic areas; (3) to create anawareness of the opportunities for careers in statistics among young men and


10 A brief history of COPSSwomen of high potential mathematical ability and to encourage them to seekadvanced training in statistics; (4) to provide information and advice to students,student counselors, and university and college faculties on the presentavailability of advanced training in statistics; (5) to encourage and stimulatenew programs in statistics both to supplement programs in other curriculaand to develop the further training of statisticians.The 1963–64 VLP was highly successful. There were about 100 requestsfor lectures and over 70 visits were made by 31 lecturers. Almost every schoolwhich was assigned a speaker received the first of its three choices. Lecturerswere urged to give two talks, one at a technical level and one on statisticsas a career. The program continued into the 1990s. A 1991–93 reportby then-Chair Lynne Billard noted that in the two-year period there wereover 40 visits. These were mainly to universities in which there was no statisticsdepartment. Examples are Allegheny College, Carson Newman College,Bucknell University, Furman University, Moorhead State University, MemphisState University, University of Puerto Rico, and Wabash University, to namebut a few. In general the program was not designed for universities with anactive statistics department. The chairs of the VLP are listed in Table 1.2.TABLE 1.2List of Chairs of the Visiting Lecturer Program (VLP).Period Chair1962–63 Jack C. Kiefer1964–65 W. Jackson Hall1966–67 Shanti S. Gupta1970–71 Don B. Owen1974–75 Herbert T. David1984–86 Jon Kettenring1987–89 Fred Leysieffer1990–94 Lynne BillardFor a list of the lecturers and number of their visits, see The AmericanStatistician: 1964, no. 4, p. 6; 1965, no. 4, p. 5; 1966, no 4, p. 12; 1967, no. 4,p. 8; 1968, no. 4, p. 3; 1970, no. 5, p. 2; 1971, no. 4, p. 2.1.4 AwardsCOPSS initiated a number of awards which have become prestigious hallmarksof achievement. These are the Presidents’ Award, the R.A. Fisher Lectureship,the George W. Snedecor Award, the Elizabeth L. Scott Award, and the FlorenceNightingale David Award. The COPSS Awards Ceremony usually takes


I. Olkin 11place on Wednesdays of the Joint Statistical Meetings. The Presidents’ Award,the Snedecor or Scott Award, and the David Award are announced first, andare followed by the Fisher Lecture. All awards now include a plaque and acash honorarium of $1,000.The George W. Snedecor and Elizabeth L. Scott Awards receive supportfrom an endowment fund. In the early days of these awards, financing wasin a precarious state. This led to a solicitation for funds in August 1995 thatraised $12,100. Subsequently a fiscal policy was established for the support ofawards.This history of COPSS provides us with an opportunity to thank the followingdonors for their assistance at that time: Abbott Laboratories ($500);Biopharmaceutical Research Consultants ($100); Bristol-Myers Squibb PharmaceuticalResearch Institute ($500); Chapman & Hall ($1,000); COPSS($4,250); Duxbury Press ($500); Institute of Mathematical Statistics ($500);Institute for Social Research Survey Research Center, University of Michigan($500); Iowa State University ($750); Procter & Gamble ($500); Springer-Verlag ($500); Section on Statistical Graphics ($500); SYSTAT ($500); TrilogyConsulting Corporation ($500); John Wiley and Sons ($1,000).1.4.1 Presidents’ AwardCOPSS sponsors the Presidents’ Award and presents it to a young member ofthe statistical community in recognition of an outstanding contribution to theprofession of statistics. The Presidents’ Award was established in 1976 andis jointly sponsored by the American Statistical Association, the Institute ofMathematical Statistics, the Biometric Society ENAR, the Biometric SocietyWNAR, and the Statistical Society of Canada operating through COPSS. (In1994 the Biometric Society became the International Biometric Society.) Thefirst award was given in 1979, and it is presented annually.According to the award description, “The recipient of the Presidents’Award shall be a member of at least one of the participating societies. ThePresidents’ Award is granted to an individual who has not yet reached his orher 40th birthday at the time of the award’s presentation. The candidate maybe chosen for a single contribution of extraordinary merit, or an outstandingaggregate of contributions, to the profession of statistics.”The Presidents’ Award Committee consists of seven members, includingone representative appointed by each of the five member societies plus theCOPSS Chair, and a past awardee as an additional member. The Chair of theAward Committee is appointed by the Chair of COPSS.Prior to 1988 the COPSS Presidents’ Award was funded by the ASA.When the 1987 award depleted this account, COPSS voted to reallocate afraction of membership dues to fund future awards. Recipients are listed inTable 1.3 with their affiliation at the time of the award.


12 A brief history of COPSSTABLE 1.3List of COPSS Presidents’ Award winners.Year Winner Affiliation (at the Time of Award)1979 Peter J. Bickel University of California, Berkeley1981 Stephen E. Fienberg Carnegie Mellon University1983 Tze Leung Lai Columbia University1984 David V. Hinkley University of Texas, Austin1985 James O. Berger Purdue University1986 Ross L. Prentice Fred Hutchinson Cancer Research Center1987 Chien-Fu Jeff Wu University of Wisconsin, Madison1988 Raymond J. Carroll Texas A&M University1989 Peter Hall Australian National University1990 Peter McCullagh University of Chicago1991 Bernard W. Silverman University of Bristol, UK1992 Nancy M. Reid University of Toronto, Canada1993 Wing Hung Wong University of Chicago1994 David L. Donoho Stanford University1995 Iain M. Johnstone Stanford University1996 Robert J. Tibshirani University of Toronto, Canada1997 Kathryn Roeder Carnegie Mellon University1998 Pascal Massart Université deParis-Sud,France1999 Larry A. Wasserman Carnegie Mellon University2000 Jianqing Fan University of North Carolina, Chapel Hill2001 Xiao-Li Meng Harvard University2002 Jun Liu Harvard University2003 Andrew Gelman Columbia University2004 Michael A. Newton University of Wisconsin, Madison2005 Mark J. van der Laan University of California, Berkeley2006 Xihong Lin Harvard University2007 Jeffrey S. Rosenthal University of Toronto, Canada2008 T. Tony Cai University of Pennsylvania2009 Rafael Irizarry Johns Hopkins University2010 David B. Dunson Duke University2011 Nilanjan Chatterjee National Cancer Institute2012 Samuel Kou Harvard University2013 Marc Suchard University of California, Los Angeles


I. Olkin 131.4.2 R.A. Fisher LectureshipThe R.A. Fisher Lectureship was established in 1963 to honor both the contributionsof Sir Ronald Aylmer Fisher and a present-day statistician for theiradvancement of statistical theory and applications. The list of past FisherLectures well reflects the prestige that COPSS and its constituent societiesplace on this award. Awarded each year, the Fisher Lectureship representsmeritorious achievement and scholarship in statistical science, and recognizeshighly significant impacts of statistical methods on scientific investigations.The Lecturer is selected by the R.A. Fisher Lecture and Award Committeewhich is chosen to reflect the interests of the member societies. The Lecturehas become an integral part of the COPSS program, and is given at the JointStatistical Meeting.In the early days of COPSS, the award of the Lectureship was governed bythe following conditions: (1) The Fisher Lectureship is awarded annually toan eminent statistician for outstanding contributions to the theory and applicationof statistics; (2) the Fisher Lecture shall be presented at a designatedAnnual Meeting of the COPSS societies; (3) the Lecture shall be broadlybased and emphasize those aspects of statistics and probability which bearclose relationship to the scientific collection and interpretation of data, areasin which Fisher himself made outstanding contributions; (4) the Lecture shallbe scheduled so as to have no conflict with any other session at the AnnualMeeting; (5) the Chair of the Lecture shall be the Chair of the R.A. FisherLecture and Award Committee or the Chair’s designee: the Chair shall presenta short statement on the life and works of R.A. Fisher, not to exceed five minutesin duration, and an appropriate introduction for the Fisher Lecturer;(6) the Lecturer is expected to prepare a manuscript based on the Lectureand to submit it to an appropriate statistical journal. There is an additionalhonorarium of $1,000 upon publication of the Fisher Lecture.The recipients of the R.A. Fisher Lectureship are listed in Table 1.4, togetherwith the titles of their lectures and their affiliations at the time of theaward.TABLE 1.4Recipients of the R.A. Fisher Lectureship and titles of their lectures.Year Winner and Affiliation (Title of the Talk)1964 Maurice S. Bartlett, University of Chicagoand University College London, UKR.A. Fisher and the last fifty years of statistical methodology1965 Oscar Kempthorne, Iowa State UniversitySome aspects of experimental inference1966 (none)1967 John W. Tukey, Princeton University and Bell LabsSome perspectives in data analysis


14 A brief history of COPSSTABLE 1.4Recipients of the R.A. Fisher Lectureship (cont’d).Year Winner and Affiliation (Title of the Talk)1968 Leo A. Goodman, University of ChicagoThe analysis of cross-classified data: Independence,quasi-independence, and interactions in contingency tableswith or without missing entries1970 Leonard J. Savage, Princeton UniversityOn rereading R.A. Fisher1971 Cuthbert Daniel, Private ConsultantOne-at-a-time plans1972 William G. Cochran, Harvard UniversityExperiments for nonlinear functions1973 Jerome Cornfield, George Washington UniversityOn making sense of data1974 George E.P. Box, University of Wisconsin, MadisonScience and statistics1975 Herman Chernoff, Massachusetts Institute of TechnologyIdentifying an unknown member of a large population1976 George A. Barnard, University of Waterloo, CanadaRobustness and the logic of pivotal inference1977 R.C. Bose, University of North CarolinaR.A. Fisher’s contribution to multivariate analysisand design of experiments1978 William H. Kruskal, University of ChicagoStatistics in society: Problems unsolved and unformulated1979 C.R. Rao, The Pennsylvania State UniversityFisher efficiency and estimation of several parameters1980 (none)1981 (none)1982 Frank J. Anscombe, Yale UniversityHow much to look at the data1983 I. Richard Savage, University of MinnesotaNonparametric statistics and a microcosm1984 (none)1985 T.W. Anderson, Stanford UniversityR.A. Fisher and multivariate analysis1986 David H. Blackwell, University of California, BerkeleyLikelihood and sufficiency1987 Frederick Mosteller, Harvard UniversityMethods for studying coincidences (with P. Diaconis)1988 Erich L. Lehmann, University of California, BerkeleyModel specification: Fisher’s views and some later strategies1989 Sir David R. Cox, Nuffield College, OxfordProbability models: Their role in statistical analysis1990 Donald A.S. Fraser, York University, CanadaStatistical inference: Likelihood to significance1991 David R. Brillinger, University of California, BerkeleyNerve cell spike train data analysis: A progression of technique


I. Olkin 15TABLE 1.4Recipients of the R.A. Fisher Lectureship (cont’d).Year Winner and Affiliation (Title of the Talk)1992 Paul Meier, Columbia UniversityThe scope of general estimation1993 Herbert E. Robbins, Columbia UniversityN and n: Sequential choice between two treatments1994 Elizabeth A. Thompson, University of WashingtonLikelihood and linkage: From Fisher to the future1995 Norman E. Breslow, University of WashingtonStatistics in epidemiology: The case-control study1996 Bradley Efron, Stanford UniversityR.A. Fisher in the 21st century1997 Colin L. Mallows, AT&T Bell LaboratoriesThe zeroth problem1998 Arthur P. Dempster, Harvard UniversityLogistic statistics: Modeling and inference1999 John D. Kalbfleisch, University of Waterloo, CanadaThe estimating function bootstrap2000 Ingram Olkin, Stanford UniversityR.A. Fisher and the combining of evidence2001 James O. Berger, Duke UniversityCould Fisher, Jeffreys, and Neyman have agreed on testing?2002 Raymond J. Carroll, Texas A&M UniversityVariability is not always a nuisance parameter2003 Adrian F.M. Smith, University of London, UKOn rereading L.J. Savage rereading R.A. Fisher2004 Donald B. Rubin, Harvard UniversityCausal inference using potential outcomes:Design, modeling, decisions2005 R. Dennis Cook, University of MinnesotaDimension reduction in regression2006 Terence P. Speed, University of California, BerkeleyRecombination and linkage2007 Marvin Zelen, Harvard School of Public HealthThe early detection of disease: Statistical challenges2008 Ross L. Prentice, Fred Hutchinson Cancer Research CenterThe population science research agenda:Multivariate failure time data analysis methods2009 Noel Cressie, The Ohio State UniversityWhere, when, and then why2010 Bruce G. Lindsay, Pennsylvania State UniversityLikelihood: Efficiency and deficiency2011 C.F. Jeff Wu, Georgia Institute of TechnologyPost-Fisherian experimentation: From physical to virtual2012 Roderick J. Little, University of MichiganIn praise of simplicity not mathematistry!Simple, powerful ideas for the statistical scientist2013 Peter J. Bickel, University of California, BerkeleyFrom Fisher to “Big Data”: Continuities and discontinuities


16 A brief history of COPSS1.4.3 George W. Snedecor AwardEstablished in 1976, this award honors George W. Snedecor who was instrumentalin the development of statistical theory in biometry. It recognizes anoteworthy publication in biometry appearing within three years of the dateof the award. Since 1991 it has been given every other year in odd years.George W. Snedecor was born on October 20, 1881, in Memphis, TN,and was educated at the Alabama Polytechnic Institute, the University of Alabama,and the University of Michigan. He joined the faculty of Iowa StateCollege (University) in 1913 and taught there for 45 years. In 1924, he andhis colleague Henry Wallace (who became Secretary of Agriculture, 1933–40,33rd Vice President of the United States, 1941–45, and Secretary of Commerce,1945–46) organized a seminar to study regression and data analysis.He formed the Iowa State Statistics Laboratory in 1933 and served as Director.His book Statistical Methods was published in 1937, and later, with WilliamG. Cochran as co-author, went through seven editions. Iowa State’s Departmentof Statistics separated from the Mathematics Department in 1939; itoffered a Master’s in statistics, the first of which was given to Gertrude Cox.The F distribution, which is central to the analysis of variance, was obtainedby Snedecor and called F after Fisher. Snedecor served as president ofthe American Statistical Association in 1948, was named an Honorary Fellowof the Royal Statistical Society in 1954, and received an honorary Doctorateof Science from North Carolina State University in 1956. Further detailsabout Snedecor are contained in “Tales of Statisticians” and “Statisticians inHistory” (Amstat News, September2009,pp.10–11).The recipients of the George W. Snedecor Award are listed in Table 1.5,along with references for the awarded publications.TABLE 1.5Recipients of the George W. Snedecor Award and publication(s).1977 A. Philip DawidProperties of diagnostic data distribution.Biometrics, 32:647–658.1978 Bruce W. Turnbull and Toby J. MitchellExploratory analysis of disease prevalence data from survival/sacrifice experiments.Biometrics, 34:555–570.1979 Ethel S. GilbertThe assessment of risks from occupational exposure to ionizingradiation. In Energy and Health, SIAM–SIMSConferenceSeriesNo. 6 (N. Breslow, Ed.), SIAM, Philadelphia, PA, pp. 209–225.1981 Barry H. Margolin, Norman Kaplan, and Errol ZeigerStatistical analysis of the Ames salmonella/microsome test.Proceedings of the National Academy of Science, 78:3779–3783.


I. Olkin 17TABLE 1.5Recipients of the George W. Snedecor Award (cont’d).1982 Byron J.T. MorganModeling polyspermy.Biometrics, 38:885–898.1983 Cavell Brownie and Douglas S. RobsonEstimation of time-specific survival rates from tag-resighting samples:A generalization of the Jolly–Seber model.Biometrics, 39:437–453;and1983 R.A. Maller, E.S. DeBoer, L.M. Joll, D.A. Anderson, and J.P. HindeDetermination of the maximum foregut volume of Western Rock Lobsters(Panulirus cygnus) fromfielddata.Biometrics, 39:543–551.1984 Stuart H. HurlbertPseudoreplication and the design of ecological field experiments.Ecological Monographs, 54:187–211;and1984 John A. AndersonRegression and ordered categorical variables.Journal of the Royal Statistical Society, Series B, 46:1–30.1985 Mitchell H. Gail and Richard SimonTesting for qualitative interactions between treatment effects andpatients subsets.Biometrics, 41:361–372.1986 Kung-Yee Liang and Scott L. ZegerLongitudinal data analysis using generalized linear models.Biometrika, 73:13–22;andLongitudinal data analysis for discrete and continuous outcomes.Biometrics, 42:121–130.1987 George E. BonneyRegressive logistic models for familial disease and other binary traits.Biometrics, 42:611–625;Logistic regression for dependent binary observations.Biometrics, 43:951–973.1988 Karim F. Hirji, Cyrus R. Mehta, and Nitin R. PatelExact inference for matched case-control studies.Biometrics, 44:803–814.1989 Barry I. Graubard, Thomas R. Fears, and Mitchell H. GailEffects of cluster sampling on epidemiologic analysis inpopulation-based case-control studies.Biometrics, 45:1053–1071.1990 Kenneth H. Pollack, James D. Nichols, Cavell Brownie, and James E. HinesStatistical inference for capture-recapture experiments.Wildlife Monographs, TheWildlifeSociety107.1993 Kenneth L. Lange and Michael L. BoehnkeBayesian methods and optimal experimental design forgene mapping by radiation hybrid.Annals of Human Genetics, 56:119–144.


18 A brief history of COPSSTABLE 1.5Recipients of the George W. Snedecor Award (cont’d).1995 Norman E. Breslow and David ClaytonApproximate inference in generalized linear models.Journal of the American Statistical Association, 88:9–25.1997 Michael A. NewtonBootstrapping phylogenies: Large deviations and dispersion effects.Biometrika, 83:315–328;and1997 Kathryn Roeder, Raymond J. Carroll, and Bruce G. LindsayA semiparametric mixture approach to case-controlstudies with errors in covariables.Journal of the American Statistical Association, 91:722–732.1999 Daniel Scharfstein, Anastasios Butch Tsiatis, and Jamie RobinsSemiparametric efficiency and its implications on thedesign and analysis of group sequential studies.Journal of the American Statistical Association, 92:1342–1350.2001 Patrick J. HeagertyMarginally specified logistic-normal modelsfor longitudinal binary data.Biometrics, 5:688–698.2003 Paul R. RosenbaumEffects attributable to treatment: Inference in experimentsand observational studies with a discrete pivot.Biometrika, 88:219–231;andAttributing effects to treatment in matched observational studies.Journal of the American Statistical Association, 97:183–192.2005 Nicholas P. Jewell and Mark J. van der LaanCase-control current status data.Biometrika, 91:529–541.2007 Donald B. RubinThe design versus the analysis of observational studiesfor causal effects: Parallels with the design of randomized trials.Statistics in Medicine, 26:20–36.2009 Marie DavidianImproving efficiency of inferences in randomizedclinical trials using auxiliary covariates.Biometrics, 64:707–715(by M. Zhang, A.A. Tsiatis, and M. Davidian).2011 Nilanjan ChatterjeeShrinkage estimators for robust and efficient inferencein haplotype-based case-control studies.Journal of the American Statistical Association, 104:220–233(by Y.H. Chen, N. Chatterjee, and R.J. Carroll).2013 John D. KalbfleischPointwise nonparametric maximum likelihood estimator ofstochastically ordered survival functions.Biometrika, 99:327–343(by Y. Park, J.M.G. Taylor, and J.D. Kalbfleisch).


I. Olkin 191.4.4 Elizabeth L. Scott AwardIn recognition of Elizabeth L. Scott’s lifelong efforts to further the careers ofwomen, this award is presented to an individual who has helped foster opportunitiesin statistics for women by developing programs to encourage womento seek careers in statistics; by consistently and successfully mentoring womenstudents or new researchers; by working to identify gender-based inequities inemployment; or by serving in a variety of capacities as a role model. Firstawarded in 1992, it is given every other year in even-numbered years.Elizabeth Scott was born in Fort Sill, Oklahoma, on November 23, 1917.Her family moved to Berkeley, where she remained for the rest of her life. Shewas in the UC Berkeley astronomy program and published more than ten paperson comet positions. She received her PhD in 1949. Her dissertation waspart astronomy and part statistics: “(a) Contribution to the problem of selectiveidentifiability of spectroscopic binaries; (b) Note on consistent estimatesof the linear structural relation between two variables.” She collaborated withJerzy Neyman on astronomical problems as well as weather modification.In 1970 Elizabeth Scott co-chaired a university sub-committee which publisheda comprehensive study on the status of women in academia. Subsequentlyshe led follow-up studies concerning gender-related issues such assalary discrepancies and tenure and promotion. She developed a toolkit forevaluating salaries that was distributed by the American Association of UniversityProfessors and used by many academic women to argue successfullyfor salary adjustments. She often told of her history in the Astronomy Departmentwhich provided a telescope to every male faculty, but not to her.She received many honors and awards, and served as president of the IMS,1977–78, and of the Bernoulli Society, 1983–85. She was Chair of the StatisticsDepartment from 1968 to 1973. She was a role model for many of the womenwho are our current leaders. She died on December 20, 1988.TABLE 1.6Recipients of the Elizabeth L. Scott Award.Year Winner Affiliation (at the Time of the Award)1992 Florence N. David University of California, Riverside1994 Donna Brogan University of North Carolina, Chapel Hill1996 Grace Wahba University of Wisconsin, Madison1998 Ingram Olkin Stanford University2000 Nancy Flournoy University of Missouri, Columbia2002 Janet Norwood Bureau of Labor Statistics2004 Gladys Reynolds Centers for Disease Control and Prevention2006 Louise Ryan Harvard University2008 Lynne Billard University of Georgia2010 Mary E. Thompson University of Waterloo, Canada2012 Mary W. Gray American University


20 A brief history of COPSSFor more details of her life and accomplishments, the web site “Biographiesof Women Mathematicians” (http://www.agnesscott.edu/lriddle/women)recommends: (1) “Elizabeth Scott: Scholar, Teacher, Administrator,” StatisticalScience, 6:206–216; (2) “Obituary: Elizabeth Scott, 1917–1988,” Journalof the Royal Statistical Society, Series A, 153:100; (3) “In memory of ElizabethScott,” Newsletter of the Caucus for Women in Statistics, 19:5–6.Therecipients of the Elizabeth L. Scott Award are listed in Table 1.6.1.4.5 Florence Nightingale David AwardThis award recognizes a female statistician who exemplifies the contributionsof Florence Nightingale David, an accomplished researcher in combinatorialprobability theory, author or editor of numerous books including a classicon the history of probability theory, Games, Gods, and Gambling, and firstrecipient of the Elizabeth L. Scott Award. Sponsored jointly by COPSS andthe Caucus for Women in Statistics, the award was established in 2001 andconsists of a plaque, a citation, and a cash honorarium. It is presented everyother year in odd-numbered years if, in the opinion of the Award Committee,an eligible and worthy nominee is found. The Award Committee has the optionof not giving an award for any given year.F.N. David was born in the village of Irvington in Herefordshire, England,on August 23, 1909. She graduated from Bedford College for Women in 1931with a mathematics degree. She sought advice from Karl Pearson about obtainingan actuarial position, but instead was offered a research position atUniversity College, London. David collaborated with Pearson and Sam Stouffer(a sociological statistician) on her first paper, which appeared in 1932.Neyman was a visitor at this time and urged her to complete her PhD, whichshe did in 1938. During the war, she served as a statistician in military agencies.She remained at University College until 1967 when she joined the Universityof California at Riverside, serving as Chair of Biostatistics which waslater renamed the Department of Statistics. Her research output was variedand included both theory and applications. She published Probability Theoryfor Statistical Methods in 1949, and jointly with D.E. Barton, CombinatorialChance in 1962. David died in 1993 at the age of 83. The recipients of theF.N. David Award are listed in Table 1.7.TABLE 1.7Recipients of the Florence Nightingale David Award.Year Recipient Affiliation2001 Nan M. Laird Harvard University2003 Juliet Popper Shaffer University of California, Berkeley2005 Alice S. Whittemore Stanford University2007 Nancy Flournoy University of Missouri, Columbia2009 Nancy M. Reid University of Toronto, Canada2011 Marie Davidian North Carolina State University2013 Lynne Billard University of Georgia


Part IIReminiscences andpersonal reflections oncareer paths


2Reminiscences of the Columbia UniversityDepartment of Mathematical Statistics inthe late 1940sIngram OlkinDepartment of StatisticsStanford University, Stanford, CA2.1 Introduction: Pre-ColumbiaEvery once in a while in a dinner conversation, I have recalled my studentdays at Columbia, and have met with the suggestion that I write up theserecollections. Although present-day students may recognize some of the famousnames such as Hotelling, Wald, and Wolfowitz, they won’t meet manyfaculty who were their students. The following is the result, and I hope thereader finds these reminiscences interesting. Because recollections of 60 yearsago are often inaccurate, I urge readers to add to my recollections.IstartedCityCollege(CCNY)in1941andin1943enlistedintheUSArmyAir Force meteorology program. After completion of the program, I served asameteorologistatvariousairportsuntilIwasdischargedin1946.Ireturnedto CCNY and graduated in 1947, at which time I enrolled at Columbia University.As an aside, the professor at CCNY was Selby Robinson. Althoughnot a great teacher, he somehow inspired a number of students to continuetheir study of statistics. Kenneth Arrow, Herman Chernoff, Milton Sobel, andHerbert Solomon are several who continued their studies at Columbia aftergraduating from CCNY.Harold Hotelling was a key figure in my career. After receiving a doctorateat Princeton, Hotelling was at Stanford from 1924 to 1931, at the FoodResearch Institute and the Mathematics Department. In 1927 he taught threecourses at Stanford: mathematical statistics (among the very early facultyto teach a rigorous course in statistics), differential geometry, and topology(who would tackle this today?). In 1931 he moved to Columbia, where hewrote his most famous papers in economics and in statistics (principal components,canonical correlations, T 2 , to mention but a few). His 1941 paper on23


24 Reminiscences from Columbiathe teaching of statistics had a phenomenal impact. Jerzy Neyman stated thatit was one of the most influential papers in statistics. Faculty attempting toconvince university administrators to form a Department of Statistics oftenused this paper as an argument why the teaching of statistics should be doneby statisticians and not by faculty in substantive fields that use statistics. Toread more about Hotelling, see Olkin and Sampson (2001a,b).2.2 Columbia daysAt Columbia, Hotelling had invited Abraham Wald to Columbia in 1938, andwhen Hotelling left in 1946 to be Head of the Statistics Department at ChapelHill, Wald became Chair of the newly formed department at Columbia. Thedepartment was in the Faculty of Political Economy because the MathematicsDepartment objected to statistics being in the same division. The twoeconomists F.E. Croxton and F.C. Mills taught statistics in the EconomicsDepartment and insisted that the new department be the Department ofMathematical Statistics to avoid any competition with their program. Theother faculty were Ted Anderson, Jack Wolfowitz, later joined by HowardLevene and Henry Scheffé; Helen Walker was in the School of Education. (Helenwas one of a few well-known, influential female statisticians. One sourcestates that she was the first woman to teach statistics.) For a detailed historyof the department, see Anderson (1955).In the late 1940s Columbia, Chapel Hill, and Berkeley were statisticalcenters that attracted many visitors. There were other universities that hadan impact in statistics such as Princeton, Iowa State, Iowa, Chicago, Stanford,and Michigan, but conferences were mostly held at the top three. The first twoBerkeley Symposia were in 1946 and 1950, and these brought many visitorsfrom around the world.The Second Berkeley Symposium brought a galaxy of foreign statisticiansto the US: Paul Lévy, Bruno de Finetti, Michel Loève, Harold Cramér, AryehDvoretzky, and Robert Fortet. Domestic faculty were present as well, such asRichard Feynman, Kenneth Arrow, Jacob Marshak, Harold Kuhn, and AlbertTucker. Because some of the participants came from distant lands, they oftenvisited other universities as part of the trip. During the 1946–48 academicyears the visitors were Neyman, P.L Hsu, J.L. Doob, M.M. Loève, E.J.G.Pitman, R.C. Bose, each teaching special-topic courses. Later Bose and Hsujoined Hotelling at Chapel Hill.With the GI Bill, I did not have to worry about tuition, and enrolled atColumbia in two classes in the summer of 1947. The classes were crowdedwith post-war returnees. One class was a first course in mathematical statisticsthat was taught by Wolfowitz. Some of the students at Columbia duringthe 1947–50 period were Raj Bahadur and Thelma Clark (later his wife), Bob


I. Olkin 25Bechhofer, Allan Birnbaum, Al Bowker, Herman Chernoff (he was officially atBrown University, but worked with Wald), Herb T. David, Cyrus Derman, SylvanEhrenfeld, Harry Eisenpress, Peter Frank, Leon Herbach, Stanley Isaacson,Seymour Jablon, Jack Kiefer, Bill Kruskal, Gerry Lieberman, GottfriedNoether, Rosedith Sitgreaves, Milton Sobel, Herbert Solomon, Charles Stein,Henry Teicher, Lionel Weiss, and many others. Columbia Statistics was an excitingplace, and almost all of the students continued their career in statistics.There was a feeling that we were in on the ground floor of a new field, andin many respects we were. From 1950 to 1970 The Annals of MathematicalStatistics grew from 625 to 2200 pages, with many articles from the studentsof this era.Some statistics classes were held at night starting at 5:40 and 7:30 sothat students who worked during the day could get to class. However, mathclasses took place during the day. I took sequential analysis and analysis ofvariance from Wald, core probability from Wolfowitz, finite differences fromB.O. Koopman, linear algebra from Howard Levi, differential equations fromRitt, a computer science course at the Columbia Watson Lab, and a course onanalysis of variance from Helen Walker. Anderson taught multivariate analysisthe year before I arrived. Govind Seth and Charles Stein took notes from thiscourse, which later became Anderson’s book on multivariate analysis.Wald had a classic European lecture style. He started at the upper leftcorner of the blackboard and finished at the lower right. The lectures weresmooth and the delivery was a uniform distribution. Though I had a lovelyset of notes, Wald treated difficult and easy parts equally, so one did notrecognize pitfalls when doing homework. The notion of an application in itscurrent use did not exist. I don’t recall the origin of the following quotation,but it is attributed to Wald: “Consider an application. Let X 1 ,...,X n bei.i.d. random variables.” In contrast to Wald’s style, Wolfowitz’s lectures weredefinitely not smooth, but he attempted to emphasize the essence of the topic.He struggled to try to explain what made the theorem “tick,” a word heoften used: “Let’s see what makes this tick.” However, as a novice in the fieldthe gems of insight that he presented were not always appreciated. It wasonly years later as a researcher that they resurfaced, and were found to beilluminating.Wolfowitz had a number of other pet phrases such as “It doesn’t cut anyice,” and “stripped of all baloney.” It was a surprise to hear Columbia graduatesyears later using the same phrase. In a regression class with Wolfowitzwe learned the Gauss–Seidel method. Wolfowitz was upset that the Doolittlemethod had a name attached to it, and he would exclaim, “Who is thisDoolittle?” Many years later when Wolfowitz visited Stanford a name mightarise in a conversation. If Wolfowitz did not recognize the name he would say“Jones, Jones, what theorem did he prove?”In 1947–48 the only serious general textbooks were Cramér, Kendall, andWilks’ soft-covered notes. This was a time when drafts of books were beingwritten. Feller’s Volume I appeared in 1950, Doob’s book on stochastic pro-


26 Reminiscences from Columbiacesses in 1953, Lehmann’s notes on estimation and testing of hypotheses in1950, Scheffé’s book on analysis of variance in 1953. The graduate studentsat Columbia formed an organization that duplicated lecture notes, especiallythose of visitors. Two that I remember are Doob’s lectures on stochastic processesand Loève’s on probability.The Master’s degree program required a thesis and mine was written withWolfowitz. The topic was on a sequential procedure that Leon Herbach (hewas ahead of me) had worked on. Wolfowitz had very brief office hours, sothere usually was a queue to see him. When I did see him in his office heasked me to explain my question at the blackboard. While talking at theblackboard Wolfowitz was multi-tasking (even in 1947) by reading his mailand talking on the telephone. I often think of this as an operatic trio in whicheach singer is on a different wavelength. This had the desired effect in thatI never went back. However, I did manage to see him after class. He once said“Walk me to the subway while we are talking,” so I did. We did not finishour discussion by the time we reached the subway (only a few blocks away)so I went into the subway where we continued our conversation. This was notmy subway line so it cost me a nickel to talk to him. One of my students atStanford 30 years later told me that I suggested that he walk with me whilediscussing a problem. There is a moral here for faculty.Wald liked to take walks. Milton Sobel was one of Wald’s students and heoccasionally accompanied Wald on these walks. Later I learned that Miltontook his students on walks. I wonder what is the 21st century current versionof faculty-student interaction?2.3 CoursesThe Collyer brothers became famous for their compulsive collecting. I am notin their league, but I have saved my notes from some of the courses that I took.The following is an excerpt from the Columbia course catalog.Mathematical Statistics 111a — Probability. 3 points Winter Session.Professor Wolfowitz.Tu. Th. 5:40–6:30 and 7:30–8:20 p.m. 602 Hamilton.Fundamentals. Combinatorial problems. Distribution functions inone or more dimensions. The binomial, normal, and Poisson laws. Momentsand characteristic functions. Stochastic convergence and the lawof large numbers. Addition of chance variables and limit theorems.This course terminates on Nov. 18. A thorough knowledge of calculusis an essential prerequisite. Students are advised to study higheralgebra simultaneously to obtain a knowledge of matrix algebra for usein more advanced mathematical statistics courses.


I. Olkin 27Milton Sobel was the teaching assistant for the 111a course; Robert Bechoferand Allan Birnbaum were also TAs. I remember that Milton Sobel satin on Wald’s class on analysis of variance. Because he was at least one yearahead of me I thought that he would have taken this course earlier. He said hedid take it earlier but the course was totally different depending on who wasteaching it. It was depressing to think that I would have to take every coursetwice! As the course progressed Wald ran out of subscripts and superscripts, and subsequently added some subscripts onthe left-hand side.Wolfowitz recommended three books, and assigned homework from them:on the right-hand side, e.g., x klij(a) H. Cramér (1945): Mathematical Methods of Statistics(b) J.V. Uspensky (1937): Introduction to Mathematical Probability(c) S.S. Wilks (1943): Mathematical StatisticsHe mentioned references to Kolmogorov’s Foundation of Probability and theLévy and Roth book Elements of Probability.Wolfowitz used the term “chance variables” and commented that the Lawof Small Numbers should have been called the Law of Small Probabilities. AsI look through the notes it is funny to see the old-fashioned factorial symbols⌊n instead of n!. As I reread my notes it seems to me that this coursewas a rather simplified first course in probability. Some of the topics touchedupon the use of independence, Markov chains, joint distributions, conditionaldistributions, Chebychev’s inequality, stochastic convergence (Slutsky’s theorem),Law of Large Numbers, convolutions, characteristic functions, CentralLimit Theorem (with discussion of Lyapunov and Lindeberg conditions).I have a comment in which Wolfowitz notes an error in Cramér (p. 343): (a)if y 1 ,y 2 ,... is a sequence with ∑ y i = c i for all c and σ 2 (y i ) → 0 as i →∞,then p lim i→∞ (y i − c i )=0;(b)theconverseisnottrueinthatitmaybethatσ 2 (y i ) →∞and yet p lim(y i − c i ) = 0.The second basic course was 111b, taught by Wald. The topics includedpoint estimation, consistency, unbiasedness, asymptotic variance, maximumlikelihood, likelihood ratio tests, efficiency. This course was more mathematicalthan 111a in that there was more asymptotics. In terms of mathematicalbackground I note that he used Lagrange multipliers to show that, forw 1 ,...,w n ∈ [0, 1], ∑ ni=1 w2 i /(∑ ni=1 w i) 2 is minimized when w i =1/n for alli ∈{1,...,n}. Apparently, convexity was not discussed.There is a derivation of the chi-square distribution that includes a discussionof orthogonal matrices. This is one of the standard proofs. Other topicsinclude Schwarz inequality (but not used for the above minimization), andsufficiency. The second part of the course dealt with tests of hypotheses, withemphasis on the power function (Wald used the term “power curve”), acceptancesampling, and the OC curve.My Columbia days are now over 65 years ago, but I still remember themas exciting and an incubator for many friendships and collaborations.


28 Reminiscences from ColumbiaReferencesAnderson, T.W. Jr. (1955). The Department of Mathematical Statistics. InA History of the Faculty of Political Science, Columbia, R.G. Hoxie, Ed.Columbia University Press, pp. 250–255.Olkin, I. and Sampson, A.R. (2001a). Hotelling, Harold (1895–1973).In International Encyclopedia of the Social & Behavioral Sciences(N.J. Smelser and P.B. Baltes, Eds.). Pergamon, Oxford, 6921–6925. URL http://www.sciencedirect.com/science/article/pii/B0080430767002631.Olkin, I. and Sampson, A.R. (2001b). Harold Hotelling. In Statisticians ofthe Centuries. Springer,NewYork,pp.454–458.


3AcareerinstatisticsHerman ChernoffDepartment of StatisticsHarvard University, Cambridge, MA3.1 EducationAt the early age of 15, I graduated from Townsend Harris high school in NewYork and made the daring decision to study mathematics at the City Collegeof New York (CCNY) during the depression, rather than some practical subjectlike accounting. The Mathematics faculty of CCNY was of mixed quality,but the mathematics majors were exceptionally good. Years later, one of thegraduate students in statistics at Stanford found a copy of the 1939 yearbookwith a picture of the Math Club. He posted it with a sign “Know your Faculty.”At CCNY we had an excellent training in undergraduate mathematics,but since there was no graduate program, there was no opportunity to takecourses in the advanced subjects of modern research. I was too immature tounderstand whether my innocent attempts to do original research were meaningfulor not. This gave me an appetite for applied research where successfullyconfronting a real problem that was not trivial had to be useful.While at CCNY, I took a statistics course in the Mathematics Department,which did not seem very interesting or exciting, until Professor Selby Robinsondistributed some papers for us to read during the week that he had to be away.My paper was by Neyman and Pearson (1933). It struck me as mathematicallytrivial and exceptionally deep, requiring a reorganization of my brain cellsto confront statistical issues. At that time I had not heard of R.A. Fisher,who had succeeded in converting statistics to a theoretical subject in whichmathematicians could work. Of course, he had little use for mathematiciansin statistics, on the grounds that they confronted the wrong problems and hewas opposed to Neyman–Pearson theory (NP).Once when asked how he could find the appropriate test statistic withoutrecourse to NP, his reply was “I have no trouble.” In short, NP made explicitthe consideration of the alternative hypotheses necessary to construct goodtests. This consideration was implicit for statisticians who understood their29


30 A career in statisticsproblem, but often unclear to outsiders and students. Fisher viewed it asan unnecessary mathematization, but the philosophical issue was important.Years later Neyman gave a talk in which he pointed out that the NP Lemmawas highly touted but trivial. On the other hand it took him years of thinkingto understand and state the issue.Just before graduation I received a telegram offering me a position, whichI accepted, as Junior Physicist at the Dahlgren Naval Proving Grounds inVirginia. After a year and a half I left Dahlgren to study applied mathematicsat Brown University in Providence, Rhode Island. Dean Richardson hadset up a program in applied mathematics where he could use many of thedistinguished European mathematician émigrés to teach and get involved inresearch for the Defense Department, while many of the regular faculty wereaway working at Defense establishments. There was a good deal of comingand going of faculty, students and interesting visitors during this program.During the following year I worked very hard as a Research Assistantfor Professor Stefan Bergman and took many courses and audited a couple.I wrote a Master’s thesis under Bergman on the growth of solutions of partialdifferential equations generated by his method, and received an ScM degree.One of the courses I audited was given by Professor Willy Feller, in which hislectures were a preliminary to the first volume of his two volume outstandingbooks on probability.During the following summer, I took a reading course in statistics from ProfessorHenry Mann, a number theorist who had become interested in statisticsbecause some number theory issues were predominant in some of the workgoing on in experimental design. In fact, he had coauthored a paper withAbraham Wald (Mann and Wald, 1943) on how the o and O notation couldbe extended to o p and O p .ThispaperalsoprovedthatifX n has as its limitingdistribution that of Y and g is a continuous function, then g(X n )hasasitslimiting distribution that of g(Y ).Mann gave me a paper by Wald (1939) on a generalization of inferencewhich handled that of estimation and testing simultaneously. Once moreI found this paper revolutionary. This was apparently Wald’s first paper on decisiontheory. Although it did not resemble the later version of a game againstnature, it clearly indicated the importance of cost considerations in statisticalphilosophy. Later discussions with Allen Wallis indicated that Wald had beenaware of von Neumann’s ideas about game theory. My theory is that in thisfirst paper, he had not recognized the relevance, but as his work in this fieldgrew, the formulation gradually changed to make the relationship with gametheory clearer. Certainly the role of mixed strategies in both fields made therelation apparent.At the end of the summer, I received two letters. One offered me a predoctoralNSF fellowship and the other an invitation I could not decline, to jointhe US Army. It was 1945, the war had ended, and the draft boards did notsee much advantage in deferring young men engaged in research on the wareffort. I was ordered to appear at Fort Devens, Massachusetts, where I was


H. Chernoff 31given three-day basic training and assigned to work as a clerk in the separationcenter busy discharging veterans. My hours were from 5 PM to midnight,and I considered this my first vacation in many years. I could visit Brown onweekends and on one of these visits Professor Prager suggested that I mightprefer to do technical work for the army. He arranged for me to be transferredto Camp Lee, Virginia, where I was designated to get real basic training andend up, based on my previous experience, as a clerk in the quartermaster corpsin Germany. I decided that I would prefer to return to my studies and hadthe nerve to apply for discharge on the grounds that I was a scientist, a professionin good repute at the end of the war. Much to everyone’s surprise myapplication was approved and I returned to Brown, where much had changedduring my brief absence. All the old European professors were gone, Pragerhad been put in charge of the new Division of Applied Mathematics, and anew group of applied mathematicians had replaced the émigrés.I spent some months reading in probability and statistics. In particularWald’s papers on sequential analysis, a topic classified secret during the war,was of special interest.During the summer of 1946 there was a six-week meeting in Raleigh,North Carolina, to open up the Research Triangle. Courses were offered byR.A. Fisher, J. Wolfowitz, and W. Cochran. Many prominent statisticians attendedthe meeting, and I had a chance to meet some of them and youngstudents interested in statistics, and to attend the courses. Wolfowitz taughtsequential analysis, Cochran taught sampling, and R.A. Fisher taught something.Hotelling had moved to North Carolina because Columbia University hadrefused to start a Statistics Department. Columbia realized that they hadmade a mistake, and started a department with Wald as Chair and fundsto attract visiting professors and faculty. Wolfowitz, who had gone to NorthCarolina, returned to Columbia. I returned to Brown to prepare for my preliminaryexams. Since Brown no longer had any statisticians, I asked Waldto permit me to attend Columbia to write my dissertation in absentia underhis direction. He insisted that I take some courses in statistics. In January1947, I attended Columbia and took courses from T.W. Anderson, Wolfowitz,J.L. Doob, R.C. Bose and Wald.My contact with Anderson led to a connection with the Cowles Commissionfor Research in Economics at the University of Chicago, where I wascharged with investigating the possible use of computing machines for the extensivecalculations that had to be done with their techniques for characterizingthe economy. Those calculations were being done on electric calculatingmachines by human “computers” who had to spend hours carrying 10 digitsinverting matrices of order as much as 12. Herman Rubin, who had receivedhis PhD working under T. Koopmans at Cowles and was spending a postdoctoralyear at the Institute for Advanced Study at Princeton, often came up toNew York to help me wrestle with the clumsy IBM machines at the Watson


32 A career in statisticsLaboratories of IBM, then located near Columbia. At the time, the engineersat Watson were working on developing a modern computer.3.2 Postdoc at University of ChicagoI completed my dissertation on an approach of Wald to an asymptotic approximationto the (nonexistent) solution of the Behrens–Fisher problem, and inMay, 1948, I went to Chicago for a postdoc appointment at Cowles with mynew bride, Judith Ullman. I was in charge of computing. Among my new colleagueswere Kenneth Arrow, a former CCNY mathematics major, who wasregarded as a brilliant economist, and Herman Rubin and several graduatestudents in economics, one of whom, Don Patinkin, went to Israel where heintroduced the ideas of the Cowles Commission and later became Presidentof the Hebrew University.Arrow had not yet written a dissertation, but was invited to visit RandCorporation that summer and returned with two outstanding accomplishments.One was the basis for his thesis and Nobel Prize, a proof that therewas no sensible way a group could derive a preference ordering among alternativesfrom the preference orderings of the individual members of the group.The other was a proof of the optimum character of the sequential probabilityratio test for deciding between two alternatives. The latter proof, withD. Blackwell and A. Girshick (Arrow et al., 1949), was derived after Waldpresented a proof which had some measure theoretic problems. The backwardinduction proof of ABG was the basis for the development of a large literatureon dynamic programming. The basic idea of Wald and Wolfowitz, which wasessential to the proof, had been to use a clever Bayesian argument.Ihadalwaysbeeninterestedinthephilosophicalissuesinstatistics,andJimmie Savage claimed to have resolved one. Wald had proposed the minimaxcriterion for deciding how to select one among the many “admissible”strategies. Some students at Columbia had wondered why Wald was so tentativein proposing this criterion. The criterion made a good deal of sense indealing with two-person zero-sum games, but the rationalization seemed weakfor games against nature. In fact, a naive use of this criterion would suggestsuicide if there was a possibility of a horrible death otherwise. Savage pointedout that in all the examples Wald used, his loss was not an absolute loss, buta regret for not doing the best possible under the actual state of nature. Heproposed that minimax regret would resolve the problem. At first I boughthis claim, but later discovered a simple example where minimax regret had asimilar problem to that of minimax expected loss. For another example thecriterion led to selecting the strategy A, butifB was forbidden, it led to Cand not A. This was one of the characteristics forbidden in Arrow’s thesis.


H. Chernoff 33Savage tried to defend his method, but soon gave in with the remark thatperhaps we should examine the work of de Finetti on the Bayesian approachto inference. He later became a sort of high priest in the ensuing controversybetween the Bayesians and the misnamed frequentists. I posed a list of propertiesthat an objective scientist should require of a criterion for decision theoryproblems. There was no criterion satisfying that list in a problem with a finitenumber of states of nature, unless we canceled one of the requirements. Inthat case the only criterion was one of all states being equally likely. To methat meant that there could be no objective way of doing science. I held backpublishing those results for a few years hoping that time would resolve theissue (Chernoff, 1954).In the controversy, I remained a frequentist. My main objection to Bayesianphilosophy and practice was based on the choice of the prior probability. Inprinciple, it should come from the initial belief. Does that come from birth?If we use instead a non-informative prior, the choice of one may carry hiddenassumptions in complicated problems. Besides, the necessary calculationwas very forbidding at that time. The fact that randomized strategies are notneeded for Bayes procedures is disconcerting, considering the important roleof random sampling. On the other hand, frequentist criteria lead to the contradictionof the reasonable criteria of rationality demanded by the derivationof Bayesian theory, and thus statisticians have to be very careful about theuse of frequentist methods.In recent years, my reasoning has been that one does not understand aproblem unless it can be stated in terms of a Bayesian decision problem. If onedoes not understand the problem, the attempts to solve it are like shooting inthe dark. If one understands the problem, it is not necessary to attack it usingBayesian analysis. My thoughts on inference have not grown much since thenin spite of my initial attraction to statistics that came from the philosophicalimpact of Neyman–Pearson and decision theory.One slightly amusing correspondence with de Finetti came from a problemfrom the principal of a local school that had been teaching third gradersSpanish. He brought me some data on a multiple choice exam given to thechildren to evaluate how successful the teaching had been. It was clear fromthe results that many of the children were guessing on some of the questions.Atraditionalwaytocompensateforguessingistosubtractapenaltyforeach wrong answer. But when the students are required to make a choice,this method simply applies a linear transformation to the score and does notprovide any more information than the number of correct answers. I proposeda method (Chernoff, 1962) which turned out to be an early application ofempirical Bayes. For each question, the proportion of correct answers in theclass provides an estimate of how many guessed and what proportion of thecorrect answers were guesses. The appropriate reward for a correct answershould take this estimate into account. Students who hear of this approachare usually shocked because if they are smart, they will suffer if they are in aclass with students who are not bright.


34 A career in statisticsBruno de Finetti heard of this method and he wrote to me suggestingthat the student should be encouraged to state their probability for each ofthe possible choices. The appropriate score should be a simple function ofthe probability distribution and the correct answer. An appropriate functionwould encourage students to reply with their actual distribution rather thanattempt to bluff. I responded that it would be difficult to get third graders tolist probabilities. He answered that we should give the students five gold starsand let them distribute the stars among the possible answers.3.3 University of Illinois and StanfordIn 1949, Arrow and Rubin went to Stanford, and I went to the MathematicsDepartment of the University of Illinois at Urbana. During my second year atUrbana, I received a call from Arrow suggesting that I visit the young StatisticsDepartment at Stanford for the summer and the first quarter of 1951. Thatoffer was attractive because I had spent the previous summer, supplementingmy $4,500 annual salary with a stint at the Operations Research Office ofJohns Hopkins located at Fort Lesley J. McNair in Washington, DC. I hadenjoyed the visit there, and learned about the Liapounoff theorem about the(convex) range of a vector measure, a powerful theorem that I had occasion tomake use of and generalize slightly (Chernoff, 1951). I needed some summersalary. The opportunity to visit Stanford with my child and pregnant wife wasattractive.The head of the department was A. Bowker, a protégé oftheprovostF. Terman. Terman was a radio engineer, returned from working on radar inCambridge, MA during the war, where he had learned about the advantagesof having contracts with US Government agencies and had planned to exploitsuch opportunities. Essentially, he was the father of Silicon Valley. The StatisticsDepartment had an applied contract with the Office of Naval Research(ONR) and I discovered, shortly after arriving, that as part of the contract,the personnel of the department supported by that contract were expectedto engage in research with relevance to the ONR and to address problemsposed to them on annual visits by scientists from the NSA. We distributedthe problems posed in mathematical form without much background. I wasgiven the problem of how best to decide between two alternative distributionsof a random variable X when the test statistic must be a sum of integers Ywith 1 ≤ Y ≤ k for some specified value of k and Y must be some unspecifiedfunction of X. It was clear that the problem involves partitioning the space ofX into k subsets and applying the likelihood ratio. The Liapounoff theoremwas relevant and the Central Limit Theorem gave error probabilities to useto select the best procedure.


H. Chernoff 35In working on an artificial example, I discovered that I was using theCentral Limit Theorem for large deviations where it did not apply. This ledme to derive the asymptotic upper and lower bounds that were needed forthe tail probabilities. Rubin claimed he could get these bounds with muchless work and I challenged him. He produced a rather simple argument, usingthe Markov inequality, for the upper bound. Since that seemed to be a minorlemma in the ensuing paper I published (Chernoff, 1952), I neglected to givehim credit. I now consider it a serious error in judgment, especially becausehis result is stronger, for the upper bound, than the asymptotic result I hadderived.IshouldmentionthatCramér (1938) had derived much more elegant andgeneral results on large deviations. I discovered this after I derived my results.However, Cramér did require a condition that was not satisfied by the integervaluedrandom variables in my problem. Shannon had published a paper usingthe Central Limit Theorem as an approximation for large deviations and hadbeen criticized for that. My paper permitted him to modify his results and ledto a great deal of publicity in the computer science literature for the so-calledChernoff bound which was really Rubin’s result.A second vaguely stated problem was misinterpreted by Girshick and myself.I interpreted it as follows: There exists a class of experiments, the datafrom which depend on two parameters, one of which is to be estimated. Independentobservations with repetitions may be made on some of these experiments.The Fisher information matrix is additive and we wish to minimizethe asymptotic variance, or equivalently the upper left corner of the inverseof the sum of the informations. We may as well minimize the same elementof the inverse of the average of the informations. But this average lies in theconvex set generated by the individual informations of the available experiments.Since each information matrix has three distinct elements, we havethe problem of minimizing a function on a convex set in three dimensions.It is immediately obvious that we need at most four of the original availableexperiments to match the information for any design. By monotonicity it isalso obvious that the optimum corresponds to a point on the boundary, andwe need at most three of the experiments, and a more complicated argumentshows that a mixture of at most two of the experiments will provide an asymptoticallyoptimal experiment. This result (Chernoff, 1953) easily generalizesto the case of estimating a function of r of the k parameters involved in theavailable experiments.The lively environment at Stanford persuaded me to accept a positionthere and I returned to settle in during the next academic year. Up to thenI had regarded myself as a “theoretical statistical gun for hire” with no longtermspecial field to explore. But both of the problems described above haveoptimal design implications. I also felt that the nature of scientific study wasto use experiments to learn about issues so that better experiments could beperformed until a final decision was to be made. This led me to have sequentialdesign of experiments as a major background goal.


36 A career in statisticsAt Stanford, I worked on many research projects which involved optimizationand asymptotic results. Many seemed to come easily with the use ofTaylor’s theorem, the Central Limit Theorem and the Mann–Wald results. Amore difficult case was in the theorem of Chernoff and Savage (1958) wherewe established the Hodges–Lehmann conjecture about the efficiency of thenonparametric normal scores test. I knew very little about nonparametrics,but when Richard Savage and M. Dwass mentioned the conjecture, I thoughtthat the variational argument would not be difficult, and it was easy. Whatsurprised me was that the asymptotic normality, when the hypothesis of theequality of the two distributions is false, had not been established. Our argumentapproximating the relevant cumulative distribution function by a Gaussianprocess was tedious but successful. The result apparently opened up a sideindustry in nonparametric research which was a surprise to Jimmie Savage,the older brother of Richard.One side issue is the relevance of optimality and asymptotic results. Inreal problems the asymptotic result may be a poor approximation to what isneeded. But, especially in complicated cases, it provides a guide for tabulatingfinite-sample results in a reasonable way with a minimum of relevant variables.Also, for technical reasons optimality methods are not always available, butwhat is optimal can reveal how much is lost by using practical methods andwhen one should search for substantially better ones, and often how to do so.Around 1958, I proved that for the case of a finite number of states of natureand a finite number of experiments, an asymptotically optimal sequentialdesign consists of solving a game where the payoff for the statistician usingthe experiment e against nature using θ is I(ˆθ, θ, e) and I is the Kullback–Leibler information, assuming the current estimate ˆθ is the true value of theunknown state (Chernoff, 1959). This result was generalized to infinitely manyexperiments and states by Bessler (1960) and Albert (1961) but Albert’s resultrequired that the states corresponding to different terminal decisions beseparated.This raised the simpler non-design problem of how to handle the test thatthe mean of a Normal distribution with known variance is positive or negative.Until then the closest approach to this had been to treat the case of threestates of nature a, 0, −a for the means and to minimize the expected samplesize for 0 when the error probabilities for the other states were given. Thisappeared to me to be an incorrect statement of the relevant decision problemwhich I asked G. Schwarz to attack. There the cost was a loss for the wrongdecision and a cost per observation (no loss when the mean is 0). Althoughthe techniques in my paper would work, Schwarz (1962) did a beautiful jobusing a Bayesian approach. But the problem where the mean could vary overthe entire real line was still not done.Idevotedmuchofthenextthreeyearstodealingwiththenon-designproblem of sequentially testing whether the mean of a Normal distributionwith known variance is positive or negative. On the assumption that the payofffor each decision is a smooth function of the mean µ, itseemsnaturalto


H. Chernoff 37measure the loss as the difference which must be proportional to |µ| in theneighborhood of 0. To facilitate analysis, this problem was posed in terms ofthe drift of a Wiener process, and using Bayesian analysis, was reduced toafreeboundaryprobleminvolvingtheheatequation.Thetwodimensionsare Y , the current posterior estimate of the mean, and t, theprecisionofthe estimate. Starting at (t 0 ,Y 0 ), determined by the prior distribution, Ymoves like a standard Wiener process as sampling continues and the optimalsequential procedure is to stop when Y leaves the region determined by thesymmetric boundary.The research resulted in four papers; see Chernoff (1961), Breakwell andChernoff (1964), Chernoff (1965a), and Chernoff (1965b). The first was preliminarywith some minor results and bounds and conjectures about the boundarynear t =0andlarget. Before I went off on sabbatical in London andRome, J.V. Breakwell, an engineer at Lockheed, agreed to collaborate on anapproach to large t and I planned to concentrate on small t. InLondonIfinallymade a breakthrough and gave a presentation at Cambridge where I metJ. Bather, a graduate student who had been working on the same problem.He had just developed a clever method for obtaining inner and outer boundson the boundary.Breakwell had used hypergeometric functions to get good asymptotic approximationsfor large t, but was unhappy because the calculations based onthe discrete time problem seemed to indicate that his approximations werepoor. His letter to that effect arrived just as I had derived the corrections relatingthe continuous time and discrete time problems, and these correctionsindicated that the apparently poor approximations were in fact excellent.Bather had impressed me so that I invited him to visit Stanford for apostdoc period. Let me digress briefly to mention that one of the most valuablefunctions of the department was to use the contracts to support excellentpostdocs who could do research without teaching responsibilities and appreciatecourses by Stein and Rubin that were often too difficult for many of ourown students.Breakwell introduced Bather and me to the midcourse correction problemfor sending a rocket to the moon. The instruments measure the estimated missdistance continuously, and corrections early are cheap but depend on poorestimates, while corrections later involve good estimates but are expensivein the use of fuel. We found that our methods for the sequential problemwork in this problem, yielding a region where nothing is done. But when theestimated miss distance leaves that region, fuel must be used to return. Shortlyafter we derived our results (Bather and Chernoff, 1967), a rocket was sentto the moon and about half way there, a correction was made and it went tothe desired spot. The instrumentation was so excellent (and expensive) thatour refined method was unnecessary. Bather declined to stay at Stanford asAssistant Professor and returned with his family to England to teach at SuffolkUniversity. Later I summarized many of these results in a SIAM monographon sequential analysis and optimal design (Chernoff, 1972).


38 A career in statisticsA trip to a modern factory in Italy during my sabbatical gave me the impressionthat automation still had far to go, and the study of pattern recognitionand cluster analysis could be useful. There are many methods availablefor clustering, but it seemed that an appropriate method should depend onthe nature of the data. This raised the problem of how to observe multidimensionaldata. It occurred to me that presenting each n-dimensional datapoint by a cartoon of a face, where each of the components of the data pointcontrolled a feature of the face, might be effective in some cases. A presentationof this idea with a couple of examples was received enthusiastically bythe audience, many of whom went home and wrote their own version of whatare popularly called “Chernoff faces.” This took place at a time when thecomputer was just about ready to handle the technology, and I am reasonablysure that if I had not done it, someone else would soon have thought of theidea. Apparently I was lucky in having thought of using caricatures of faces,because faces are processed in the brain differently than other visual objectsand caricatures have a larger impact than real faces; see Chernoff (1973).3.4 MIT and HarvardAt the age of 50, I decided to leave Stanford and start a statistics program atMIT in the Applied Mathematics Section of the Mathematics Department. Forseveral years, we had a vital but small group, but the School of Science was nota healthy place for recognizing and promoting excellent applied statisticians,and so I retired from MIT to accept a position at Harvard University, fromwhich I retired in 1997, but where I have an office that I visit regularly eventhough they don’t pay me.I am currently involved in a collaboration with Professor Shaw-Hwa Loat Columbia University, who was inspired by a seminar course I offered atHarvard on statistical issues in molecular biology. We have been working onvariable selection methods for large data sets with applications to biology andmedicine; see Chernoff (2009).In review, I feel that I lacked some of the abilities that are important foran applied statistician who has to handle problems on a daily basis. I lackedthe library of rough and ready techniques to produce usable results. However,I found that dealing with real applied problems, no matter how unimportant,without this library, required serious consideration of the issues and was oftena source of theoretical insight and innovation.


H. Chernoff 39ReferencesAlbert, A.E. (1961). The sequential design of experiments for infinitely manystates of nature. The Annals of Mathematics Statistics, 32:774–799.Arrow, K.J., Blackwell, D., and Girshick, M.A. (1949). Bayes and minimaxsolutions of sequential design problems. Econometrica, 17:213–244.Bather, J.A. and Chernoff, H. (1967). Sequential decisions in the control ofaspaceship.Proceedings of the Fifth Berkeley Symposium, University ofCalifornia Press, 3:181–207.Bessler, S. (1960). Theory and Application of the Sequential Design of Experiments,k-actions and Infinitely Many Experiments: Part I–Theory. TechnicalReport 55, Department of Statistics, Stanford University, Stanford,CA.Breakwell, J.V. and Chernoff, H. (1964). Sequential tests for the mean of aNormal distribution II (large t). The Annals of Mathematical Statistics,35:162–173.Chernoff, H. (1951). An extension of a result of Liapounoff on the rangeof a vector measure. Proceedings of the American Mathematical Society,2:722–726.Chernoff, H. (1952). A measure of asymptotic efficiency for tests of a hypothesisbased on the sum of observations. The Annals of MathematicalStatistics, 23:493–507.Chernoff, H. (1953). Locally optimal designs for estimating parameters. TheAnnals of Mathematical Statistics, 24:586–602.Chernoff, H. (1954). Rational selection of decision functions. Econometrica,22:422–443.Chernoff, H. (1959). Sequential design of experiments. The Annals of MathematicalStatistics, 30:755–770.Chernoff, H. (1961). Sequential tests for the mean of a Normal distribution.Proceedings of Fourth Berkeley Symposium on Mathematical Statisticsand Probability, University of California Press, 1:79–95.Chernoff, H. (1962). The scoring of multiple choice questionnaires. The Annalsof Mathematical Statistics, 35:375–393.Chernoff, H. (1965a). Sequential tests for the mean of a Normal distributionIII (small t). The Annals of Mathematical Statistics, 36:28–54.


40 A career in statisticsChernoff, H. (1965b). Sequential tests for the mean of a Normal distributionIV (discrete case). The Annals of Mathematical Statistics, 36:55–68.Chernoff, H. (1972). Sequential analysis and optimal design. Eighth RegionalConference Series in Applied Mathematics. Society for Industrial and AppliedMathematics, Philadelphia, PA.Chernoff, H. (1973). The use of faces to represent points in k-dimensionalspace graphically. Journal of the American Statistical Association,68:361–368.Chernoff, H., Lo, S.H., and Zheng, T. (2009). Discovering influential variables:A method of partitions. The Annals of Applied Statistics, 3:1335–1369.Chernoff, H. and Savage, I.R. (1958). Asymptotic normality and efficiency ofcertain non-parametric test statistics. The Annals of Mathematical Statistics,29:972–994.Cramér, H. (1938). Sur un nouveau théorème-limite de la théorie des probabilités.Actualités scientifiques et industrielles, F36, Paris, France.Mann, H.B. and Wald, A. (1943). On stochastic limit and order relationships.The Annals of Mathematical Statistics, 14:217–226.Neyman, J. and Pearson, E.S. (1933). On the problem of the most efficienttests of statistical hypotheses. Philosophical Transactions of the RoyalSociety, 231:289–337.Schwarz, G. (1962). Asymptotic shapes of Bayes sequential testing regions.The Annals of Mathematical Statistics, 33:224–236.Wald, A. (1939). Contributions to the theory of statistical estimation andtesting of hypotheses. The Annals of Mathematical Statistics, 10:299–326.


4“. . . how wonderful the field of statisticsis. . . ”David R. BrillingerDepartment of StatisticsUniversity of California, Berkeley, CA4.1 IntroductionThere are two purposes for this chapter. The first is to remind/introducereaders to some of the important statistical contributions and attitudes ofthe great American scientist John W. Tukey. The second is to take note ofthe fact that statistics commencement speeches are important elements in thecommunication of statistical lore and advice and not many seem to end up inthe statistical literature. One that did was Leo Breiman’s 1994. 1 It was titled“What is the Statistics Department 25 years from now?” Another is Tukey’s 2presentation to his New Bedford high school. There has been at least onearticle on how to prepare such talks. 3Given the flexibility of this COPSS volume, in particular its encouragementof personal material, I provide a speech from last year. It is not claimedto be wonderful, rather to be one of a genre. The speech below was deliveredJune 16, 2012 for the Statistics Department Commencement at the Universityof California in Los Angeles (UCLA) upon the invitation of the DepartmentChair, Rick Schoenberg. The audience consisted of young people, their relativesand friends. They numbered perhaps 500. The event was outdoors on abeautiful sunny day.The title and topic 4 were chosen with the goal of setting before youngstatisticians and others interested the fact that America had produced a greatscientist who was a statistician, John W. Tukey. Amongst other things hecreated the field Exploratory Data Analysis (EDA). He gave the Americanstatistical community prestige, and defined much of their work for years.Further details on specific remarks are provided in a Notes section. Thenotes are indexed by superscripts at their locations. A brief bibliography isalso provided.41


42 Wonderful field of statistics4.2 The speech (edited some)I thank Rick for inviting me, and I also thank the young lady whocheered when my name was announced. She has helped me get started.It is so very nice to see Rick and the other UCLA faculty that I haveknown through my academic years.Part of my time I am a sports statistician, and today I take specialnote of the Kings 5 winning the Stanley Cup 6 five days ago. I congratulateyou all for surely your enthusiasm energized your team. I remarkthat for many years I have had a life-size poster of Wayne Gretzky, 7wearing a Kings uniform, in my Berkeley office, 8 Rick would have seenit numerous times. All of you can enjoy this victory. I can tell youthat I am still enjoying my Leafs 9 victory although there has been adrought since.Rick asked me to talk about “how wonderful the field of statisticsis.” No problem. I welcome the opportunity. I have forever loved mycareer as a statistical scientist, and in truth don’t understand whyevery person doesn’t wish to be a statistician, 10 but there is that look.I mean the look one receives when someone asks what you do, and yousay “statistics.” As an example I mention that a previous Universityof California President once told me at a reception, seemingly proudly,that statistics had been the one course he had failed in his years at theUniversity of California. Hmmh.My talk this afternoon will provide a number of quotations associatedwith a great American scientist, Rick’s statistical grandfather,John Wilder Tukey (1915–2000)To begin, Rick, perhaps you know this already, but in case not, I mentionthat you owe John Tukey for your having an Erdős number of 4. 11Mr. Tukey had a number of aliases including: John Tukey, Mr. Tukey,Dr. Tukey, Professor Tukey, JWT, and my favorite — The Tuke. TheTuke was born June 16, 1915 in New Bedford, Massachusetts, and insome ways he never left. He was a proud New Englander, he ate applepie for breakfast, and he bought his Princeton house putting cash onthe barrelhead.Dr. Tukey was a unique individual during his childhood, as a professor,as an advisor, as an executive, and as a consultant. He learned to readat a very young age and was home schooled through high school. Hishigher education included Bachelor’s and Master’s degrees in Chemistryfrom Brown University in 1936 and 1937, followed by a Master’sand a Doctorate in Mathematics from Princeton in 1938 and 1939.


D.R. Brillinger 43He went on to be Higgins Professor at Princeton and Associate ExecutiveDirector of Research Information Sciences at Bell TelephoneLaboratories. As a graduate student at Princeton he drew attentionby serving milk instead of the usual beer at his doctoral graduationparty. 12John Tukey was quick, like Richard Feynman. 13 He could keep trackof time while reciting poetry and seemingly do three different thingssimultaneously. I watched him continually, I guess, because I had neverseen anyone quite like him before. He was called upon continually toprovide advice to presidents and other decision makers. He createdwords and phrases like: bit, software, saphe cracking, the jackknife andhis marvelous creation, EDA. 14 He delighted in vague concepts, thingsthat could be made specific in several ways, but were often better leftvague. He worked in many fields including: astronomy, cryptography,psephology, information retrieval, engineering, computing, education,psychology, chemistry, pollution control, and economics.John Tukey was firmly associated with Princeton and Bell Labs. 15Moreover, he had associations with UCLA. For example, I can mentionhis friendship and respect for Will Dixon. Will started your Biostat/Biomathgroup here in 1950 and had been John’s colleague at theFire Control Research Office (FCRO) 16 in World War II.John had the respect of scientists and executives. The Princeton physicistJohn Wheeler 17 wrote:“I believe that the whole country — scientifically, industrially,financially — is better off because of him and bears evidenceof his influence. [···]JohnTukey,likeJohnvonNeumann,was a bouncy and beefy extrovert, with interests andskills in physics as well as mathematics.”AformerPresidentofBellLabs,W.O.Baker 18 said in response to apersonal question:“John was indeed active in the analysis of the Enigma 19 systemand then of course was part of our force in the fiftieswhich did the really historic work on the Soviet codes as well.So he was very effective in that whole operation. [···]Johnhas had an incisive role in each major frontier of telecommunicationsscience and technology: uses of transistors andsolid state; digital and computers.”Dr. Tukey was involved in the construction of the von Neumann computer.In particular, A. Burks wrote:“John Tukey designed the electronic adding circuit we actuallyused in the Institute for Advanced Studies Computer. In


44 Wonderful field of statisticsthis circuit, each binary adder fed its carry output directlyinto the next stage without delay.”John Tukey was renowned for pungent sayings.“The best thing about being a statistician,” he once told acolleague, “is that you get to play in everyone’s backyard.”“The collective noun for a group of statisticians is a quarrel.”“Perhaps because I began in a hard but usually nondeductivescience — chemistry — and was prepared to learn‘facts’ rather than ‘proofs’, I have found it easier than mostto escape the mathematician’s implicit claim that the onlyreal sciences are the deductive ones.”“Doing statistics is like doing crosswords except that onecannot know for sure whether one has found the solution.”“A consultant is a man who thinks with other people’sbrains.”“The stronger the qualitative understanding the data analystcan get of the subject matter field from which his data come,the better — just so long as he does not take it too seriously.”“Most statisticians are used to winning arguments withsubject-matter colleagues because they know both statisticsand the subject matter.”“The first task of the analyst of data is quantitative detectivework.”“Well, what I think you need is folk dancing.” 20Tukey had a quick wit. For example the seismologist Bruce Bolt andIdevelopedamethodtoestimatecertainEarthparametersfollowingagreat earthquake. I half-boasted to John, that with the next one Bruceand I would be in the morning papers with estimates of the parametersand their uncertainties. John’s response was,Indeed.“What if it is in Berkeley?”


D.R. Brillinger 45Tukey wrote many important books, and many papers. A selection ofthe latter may be found in his Collected Works. 21Some advice for the studentsLearn the theory for the theory becomes the practice.Learn the math because that is the hard part of the other sciences.In consulting contexts ask, ‘What is the question?’ Ask it again, andagain, and...Answer a question with, ‘It depends,’ followed by saying what it dependsupon.Be lucky, remembering that you make your luck.Don’t forget that statisticians are the free-est of all scientists — theycan work on anything. Take advantage.Closing wordsCongratulations graduates.May your careers be wonderful and may they emulate John Tukey’s inimportant ways.Thank you for your attention.4.3 ConclusionIn my academic lifetime, statistical time series work went from the real-valueddiscrete time stationary case, to the vector-valued case, to the nonstationarycase, to the point process case, to the spatial case, to the spatial-temporalcase, to the generalized function case, to the function-valued time parametercase. It proved important that robust/resistant variants 22 followed such cases.In summary there has been a steady progression of generalization andabstraction in modeling and data analysis of random processes. Learning themathematics and continuing this progression is the challenge for the future.For more details on John Tukey’s life, see Brillinger (2002a) and Brillinger(2002b). This work was partially supported by the NSF Grant DMS–100707157.


46 Wonderful field of statisticsNotes1. www.stat.berkeley.edu/~dpurdy/Breiman-1994-commencement.html2. See p. 306 in Anscombe (2003).3. See Rodriguez (2012).4. These words come from an email of Rick’s describing his wishes for thetalk.5. The Kings are the National Hockey League (NHL) team in Los Angeles.They won the Stanley Cup in 2012.6. The Stanley Cup is the trophy awarded to the NHL championship teameach year.7. Wayne Gretzky is a renowned Canadian hockey player holding many NHLrecords.8. Room 417 in Evans Hall on the UCB campus.9. The NHL team based in Toronto, Canada, where I grew up.10. In Sacks and Ylvisaker (2012) one reads, “But seriously why would onechoose to be something other than a statistician?”11. A mathematician’s Erdős number provides the “collaborative distance”from that person to Paul Erdős.12. The famous mathematician John von Neumann is reputed to have said,“There is this very bright graduate student, and the remarkable thing isthat he does it all on milk.”13. Richard Feynman was an American physicist known for his work in thetheoretical areas. With Julian Schwinger and Sin-Itiro Tomonaga, he receivedthe Nobel Prize in Physics in 1965.14. Exploratory data analysis (EDA): 1. It is an attitude; and 2. A flexibility;and 3. Some graph paper (or transparencies, or both). See Tukey (1965).15. Bell Labs was an institution sponsored by AT&T. It was the birthplacefor many scientific and development advances.16. The Fire Control Research Office (FRCO) located in Princeton during theSecond World War.17. John Archibald Wheeler (1911–2008) was an American theoretical physicist,and colleague of Tukey at Princeton. He worked in general relativity.


D.R. Brillinger 4718. Personal communication from W.O. Baker.19. Enigma was an important code employed by the Germans in World War II.20. Personal communication from Leo Goodman.21. See Cleveland (1984–1994). There are eight volumes spread over the years1984–1994.22. Robust refers to quantities not strongly affected by non-normality, andresistant refers to those not strongly affected by outliers.ReferencesAnscombe, F.R. (2003). The civic career and times of John W. Tukey. StatisticalScience, 18:287–360.Brillinger, D.R. (2002a). John W. Tukey: His life and professional contributions.The Annals of Statistics, 30:1535–1575.Brillinger, D.R. (2002b). John W. Tukey’s work on time series and spectrumanalysis. The Annals of Statistics, 30:1595–1618.Cleveland, W.S., Ed. (1984–1994). The Collected Works of John W. Tukey.Eight volumes, Wadsworth Publishing, Belmont, CA.Rodriguez, R.W. (2012). Graduation time: Is your commencement speechready? Amstat News, No.419(May),pp.3–4.Sacks, J. and Ylvisaker, D. (2012). After 50+ years in statistics, an exchange.Statistical Science, 27:308–318.Tukey, J.W. (1965). We need both exploratory and confirmatory. The AmericanStatistician, 34:23–25.


5An unorthodox journey to statistics:Equity issues, remarks on multiplicityJuliet Popper ShafferDepartment of StatisticsUniversity of California, Berkeley, CAThe progression to my statistics career was anomalous, and not to be recommendedto anyone interested initially in statistics. A fuller account of myearlier studies, as well as information on my childhood, can be found in aninterview I gave to Robinson (2005) as former Editor of the Journal of Educational(now Educational and Behavioral) Statistics (JEBS). It is availablefor download on JSTOR and probably through many academic libraries.In this paper I will recount briefly some pre-statistical career choices, describea rather unorthodox way of becoming a statistician, introduce my majorarea, multiplicity, and note briefly some of my work in it, and make some generalremarks about issues in multiplicity. I’ll discuss the more technical issueswithout assuming any background in the subject. Except for a few recent papers,references will be only to some basic literature on the issues and not tothe recent, often voluminous literature.5.1 Pre-statistical career choicesAt about 14 years of age I read a remarkably inspiring book, “MicrobeHunters,” by Paul de Kruif (1926), and decided immediately to be a scientist.Since then, several other scientists have noted a similar experience withthat book.In addition to an interest in science, mathematics was always attractiveto me. However, I thought of it wrongly as something very remote from thereal world, like doing crossword puzzles, and wanted to be more engaged withthat world.My high school courses included a year each of beginning biology, chemistry,and physics. I wanted to take the four years of mathematics available,49


50 Unorthodox journey to statisticsbut that turned out to be a problem. At that time in my Brooklyn publichigh school, and probably at the other public high schools in New York City,boys were automatically enrolled in mathematics in the first semester of 9thgrade, and girls in a language of their choice. That left me with only 3 1/2semesters, and course availability made it impossible to take more than onemathematics class at a time.This was just one example of the stereotyping, not to speak of outrightdiscrimination, against women in those years, especially in mathematics andscience. In fact, Brooklyn Technical High School, specializing in the science–technology–engineering–mathematics (STEM) area, was restricted to boys atthat time. (It became co-ed in 1972.) My interview in JEBS discusses severalother such experiences.I solved the problem by taking intermediate algebra as an individual readingcourse, meeting once a week with a teacher and having problems assigned,along with a geometry course. In that way I managed to take all four yearsoffered.I started college (Swarthmore College) as a Chemistry major, but began toconsider other possibilities after the first year. Introductory psychology waschosen in my second year to satisfy a distribution requirement. In the introductorylecture, the professor presented psychology as a rigorous science ofbehavior, both animal and human. Although some of my fellow students foundthe lecture boring, I was fascinated. After a brief consideration of switchingto a pre-med curriculum, I changed my major to psychology.In graduate school at Stanford University I received a doctorate in psychology.I enjoyed my psychological statistics course, so took an outside concentrationin statistics with several courses in the mathematics and statisticsdepartments.There was then much discrimination against women in the academic jobworld. During the last year at Stanford, I subscribed to the American PsychologicalAssociation’s monthly Employment Bulletin. Approximately half theadvertised jobs said “Men only.” Of course, that was before overt sex discriminationwas outlawed in the Civil Rights act of 1964. After an NSF Fellowship,used for a postdoctoral year working in mathematical psychology at IndianaUniversity with one of the major contributors to that field (William Estes),I got a position in the Department of Psychology at the University of Kansas,thankfully one of the more enlightened departments.5.2 Becoming a statisticianItaughtpart-timeduringseveralyearswhilemythreechildrenweresmall.There were no special programs for this at the time, but fortunately my departmentallowed it. There was no sabbatical credit for part time, but finally,


J.P. Shaffer 51with enough full-time years to be eligible, I decided to use the sabbatical year(1973–74) to improve my statistics background. I chose the University of California(UC) Berkeley, because I had been using a book, “Testing StatisticalHypotheses,” by E.L. Lehmann, as background for my statistics teaching inthe Psychology Department.As life sometimes evolves, during that year Erich Lehmann and I decided tomarry. After a one-year return to Kansas, and a one-year visiting appointmentat UC Davis in the Mathematics Department, I joined the Statistics Departmentat UC Berkeley as a lecturer, later becoming a senior lecturer. AlthoughIhadnodegreeinstatistics,myextensiveconsultinginthePsychologyDepartmentat Kansas gave me greater applied statistical experience than mostof the Statistics faculty at UC Berkeley, and I supervised a Berkeley StatisticsDepartment consulting service for many years. We probably had about 2000clients during that time, mostly graduate students, but some faculty, retiredfaculty, and even outside individuals of all kinds. One of the most interestingand amusing contacts was with a graduate student studying caterpillar behavior.The challenge for us was that when groups of caterpillars were beingobserved, it was not possible to identify the individuals, so counts of behaviorscouldn’t be allocated individually. He came to many of our meetings, and atthe end invited all of us to a dinner he was giving in his large co-op, giving usa very fancy French menu. Can you guess what kind of a menu it was? Muchto my regret, I didn’t have the courage to go, and none of the consultantsattended either.During this early time in the department, I was also the editor of JEBS(see above) for four years, and taught two courses in the Graduate School ofEducation at Berkeley.It’s interesting to compare teaching statistics to psychologists and teachingit to statisticians. Of course the level of mathematical background was fargreater among the statisticians. But the psychologists had one feature thatstatisticians, especially those going into applied work, would find valuable.Psychological research is difficult because the nature of the field makes it possibleto have many alternative explanations, and methods often have defectsthat are not immediately obvious. As an example of the latter, I once read astudy that purported to show that if shocks were given to a subject while aparticular word was being read, the physiological reactions would generalizeto other words with similar meanings. As a way of creating time between theoriginal shock and the later tests on alternative words, both with similar anddissimilar meanings, subjects were told to sit back and think of other things.On thinking about this study, it occurred to me that subjects that had justbeen shocked on a particular word could well be thinking about other wordswith similar meanings in the interim, thus bringing thoughts of those wordsclose to the time of the shock, and not separated from it as the experimentersassumed.Thus psychologists, as part of their training, learn to think deeply aboutsuch alternative possibilities. One thing that psychologists know well is that in-


52 Unorthodox journey to statisticsdividual differences are so important that it is essential to distinguish betweenmultiple methods applied to the same groups of individuals and methods appliedto different independent groups. I found that the statisticians were lesssensitive to this than the psychologists. In the consulting classes, they sometimesforgot to even ask about that. Also, they sometimes seemed to thinkthat with the addition of a few variables (e.g., gender, ethnicity, other distinguishingvariables) they could take care of individual differences, and treat thedata as conditionally independent observations, whereas psychologists wouldbe quite wary of making that assumption. There must be other fields in whichcareful experimental thinking is necessary. This is not statistics in a narrowsense, but is certainly important for applied statisticians who may be involvedin designing studies.5.3 Introduction to and work in multiplicityIn addition to teaching many psychology courses during my time at Kansas,I also taught most of the statistics courses to the psychology undergraduateand graduate students. Analysis of variance (ANOVA) was perhaps the mostwidely-used procedure in experimental psychology. Consider, for example, aone-way treatment layout to be analyzed as an ANOVA. Given a significantF statistic, students would then compare every treatment with every otherto see which were different using methods with a fixed, conventional Type Ierror rate α (usually .05) for each. I realized that the probability of somefalse conclusions among these comparisons would be well above this nominalType I error level, growing with the number of such comparisons. This piquedmy interest in multiplicity problems, which eventually became my major areaof research.The criterion most widely considered at that time was the family-wiseerror rate (FWER), the probability of one or more false rejections (i.e., rejectionsof true hypotheses) in a set of tests. If tests are carried out individuallywith specified maximum (Type I) error rates, the probability of one or moreerrors increases with the number of tests. Thus, the error rate for the wholeset should be considered. The statistical papers I read all referred to an unpublishedmanuscript, “The Problem of Multiple Comparisons,” by John W.Tukey (1953). In those days, before Xerox, it was impossible to get copies ofthat manuscript. It was frustrating to have to use secondary sources. Fortunately,with the advent of Xerox, that problem has disappeared, and now,in addition, the manuscript is included in Tukey’s Collected Works (Braun,1994).Tukey’s treatment was extremely insightful and organized the field forsome time to follow. In his introduction, he notes that he should have publishedit as a book at that time but “One reason this did not happen was the only


J.P. Shaffer 53piece of bad advice I ever had from Walter Shewhart! He told me it was unwiseto put a book out until one was sure that it contained the last word of one’sthinking.”My earlier work in multiple comparisons is described in the interview I gaveto Robinson (2005). Since the time of Tukey’s manuscript, a suggested alternativemeasure of error to be controlled is the False Discovery Rate (FDR),introduced by Benjamini and Hochbert (1995). This is the expected proportionof false rejections, defined as zero if there are no false rejections. ControllingFDR implies that if there are a reasonable number of true rejections a smallproportion of false ones is tolerable. John Tukey himself (personal remark)was enthusiastic about this new approach for some applications.One of my early pieces of work that started a line of research by othersarose from an interest in directional inference. When one tests a hypothesisof the form θ = θ 0 ,whereθ 0 is a specific value, it is usually of interest, ifthe hypothesis is rejected, to decide either that θ>θ 0 or θ


54 Unorthodox journey to statisticsof sample treatment outcomes between the two means being compared. Ourwork shows that range-based methods lead to a higher proportion of separationsthan individual p-value methods. In connection with the FDR, it appearsthat an old test procedure, the Newman–Keuls, which fell out of favor becauseit did not control FWER, does control FDR. Extensive simulation results supportthis conclusion; proofs are incomplete. Interpretability is the main issueI’m working on at present.5.4 General comments on multiplicityAlthough championed by the very eminent John Tukey, multiplicity was abackwater of research and the issues were ignored by many researchers. Thisarea has become much more prominent with the recent advent of “Big Data.”Technological advances in recent years, as we know, have made massiveamounts of data available bringing the desire to test thousands if not millionsof hypotheses; application areas, for example, are genomics, proteomics,neurology, astronomy. It becomes impossible to ignore the multiplicity issuesin these cases, and the field has enjoyed a remarkable development withinthe last 20 years. Much of the development has been applied in the contextof big data. The FDR as a criterion is often especially relevant in this context.Many variants of the FDR criterion have been proposed, a number ofthem in combination with the use of empirical Bayes methods to estimate theproportion of true hypotheses. Resampling methods are also widely used totake dependencies into account. Another recent approach involves considerationof a balance between Type I and Type II errors, often in the context ofsimultaneous treatment of FDR and some type of false nondiscovery rate.Yet the problems of multiplicity are just as pressing in small data situations,although often not recognized by practitioners in those areas. Accordingto Young (2009), many epidemiologists feel they don’t have to take multiplicityinto account. Young and others claim that the great majority of apparentresults in these fields are Type I errors; see Ioannidis (2005). Many of thenewer approaches can’t be applied satisfactorily in small data problems.The examples cited above — one-way ANOVA designs and the large dataproblems noted — are what might be called well-structured testing problems.In general, there is a single set of hypotheses to be treated uniformly in testing,although there are variations. Most methodological research applies in thiscontext. However, there have always been data problems of a very differentkind, which might be referred to as ill-structured. These are cases in whichthere are hypotheses of different types, and often different importance, and itisn’t clear how to structure them into families, each of which would be treatedwith a nominal error control measure.


J.P. Shaffer 55Asimpleexampleisthedivisionintoprimaryandsecondaryoutcomesin clinical research. If the primary outcomes are of major importance, howshould that be taken into account? Should error control at a nominal level,for example the usual .05 level, be set separately for each set of outcomes?Should there be a single α level for the whole set, but with different weightson the two different types of outcomes? Should the analysis be treated ashierarchical, with secondary outcomes tested only if one (or more) of theprimary outcomes shows significant effects?A more complex example is the analysis of a multifactor study by ANOVA.The standard analysis considers the main effect of each factor and the interactionsof all factors. Should the whole study be evaluated at the single nominalα level? That seems unwise. Should each main effect be evaluated at thatlevel? How should interactions be treated? Some researchers feel that if thereis an interaction, main effects shouldn’t be further analyzed. But suppose onehigh-order interaction is significant at the nominal α level. Does that meanthe main-effect tests of the factors involved aren’t meaningful?Beyond these analyses, if an effect is assumed to be significant, how shouldthe ensuing more detailed analysis (e.g., pairwise comparisons of treatments)be handled, considering the multiplicity issues? There is little literature on thissubject, which is clearly very difficult. Westfall and Young (1993, Chapter 7)give examples of such studies and the problems they raise.Finally, one of the most complex situations is encountered in a large survey,where there are multiple factors of different types, multiple subgroups,perhaps longitudinal comparisons. An example is the National Assessmentof Educational Progress (NAEP), now carried out yearly, with many educationalsubjects, many subgroups (gender, race-ethnicity, geographical area,socio-economic status, etc.), and longitudinal comparisons in all these.A crucial element in all such ill-structured problems, as noted, is the definitionsof families for which error control is desired. In my two years as directorof the psychometric and statistical analysis of NAEP at Educational TestingService, we had more meetings on this subject, trying to decide on family definitionsand handling of interactions, than any other. Two examples of difficultproblems we faced:(a) Long term trend analyses were carried out by using the same test at differenttime points. For example, nine-year-olds were tested in mathematicswith an identical test given nine times from 1978 to 2004. At first it wasplanned to compare each time point with the previous one. In 1982, whenthe second test was given, there was only one comparison. In 2004, therewere eight comparisons (time 2 with time 1, time 3 with time 2, etc.).Treating the whole set of comparisons at any one time as a family, thefamily size increased with the addition of each new testing time. Thus,to control the FWER, each pairwise test had to reach a stricter level ofsignificance in subsequent analyses. But it would obviously be confusing tocall a change significant at one time only to have it declared not significantat a later time point.


56 Unorthodox journey to statistics(b) Most states, as well as some large cities, take part in state surveys, givingeducational information for those units separately. Suppose one wants tocompare every unit with every other. With, say, 45 units, the number ofcomparisons in the family is 45×44/2, or 990. On the other hand, people ina single unit (e.g., state) are usually interested mainly in how it compareswith other units; these comparisons result in a family size of 44. Resultsare likely to differ. How can one reconcile the different decisions, whenresults must be transmitted to a public that thinks that there should beasingledecisionforeachcomparison?Extensive background material and results for NAEP are available atnces.ed.gov/nationsreportcard. In addition, there is a book describing the developmentof the survey (Jones and Olkin, 2004). For information specificallyon handling of multiplicity issues, see nces.ed.gov/nationsreportcard/tdw/analysis/2000\_2001/infer\_multiplecompare\_fdr.asp.In summary, the work on multiplicity has multiplied with the advent of bigdata, although the ill-structured situations described above have been aroundfor a long time with little formal attention, and more guidance on handling offamily size issues with examples would be a contribution that could result inwider use of multiple comparison methods.ReferencesBenjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate:Apracticalandpowerfulapproachtomultipletesting.Journal of theRoyal Statistical Society, Series B, 57:289–300.Braun, H., Ed. (1994). The Collected Works of John W. Tukey. Vol. 8: MultipleComparisons, 1948–1983. Chapman & Hall, London.de Kruif, P. (1926). Microbe Hunters. Harcourt, Brace, Jovanovich, NewYork.Finner, H. (1999). Stepwise multiple test procedures and control of directionalerrors. The Annals of Statistics, 27:274–289.Guo, W., Sarkar, S.K., and Peddada, S.D. (2010). Controlling false discoveriesin multidimensional directional decisions, with applications to geneexpression data on ordered categories. Biometrics, 66:485–492.Ioannidis, J.P.A. (2005). Why most published research findings are false.PLoS Medicine, 2(8):e124.Jones, L.V. and Olkin, I., Eds. (2004). The Nation’s Report Card: Evolutionand Perspectives. Phi Delta Kappa Educational Foundation, Bloomington,IN.


J.P. Shaffer 57Robinson, D. (2005). Profiles in research: Juliet Popper Shaffer. Journal ofEducational and Behavioral Statistics, 30:93–103.Shaffer, J.P., Kowalchuk, R.K., and Keselman, H.J. (2013). Error, power,and cluster-separation rates of pairwise multiple testing procedures. PsychologicalMethods, 18:352–367.Tukey, J.W. (1953). The problem of multiple comparisons. In The CollectedWorks of John W. Tukey, Vol. 8: Multiple Comparisons, 1948–1983 (H.Braun, Ed.). Chapman & Hall, London.Westfall, P.H. and Tobias, R.D. (2007). Multiple testing of general contrasts:Truncated closure and the extended Shaffer–Royen method. Journal ofthe American Statistical Association, 102:487–494.Westfall, P.H. and Young, S.S. (1993). Resampling-based Multiple Testing:Examples and Methods for p-value Adjustment. Wiley,NewYork.Young, S.S. (2009). Health findings and false discoveries. Paper presented inthe symposium “False Discoveries and Statistics: Implications for Healthand the Environment,” at the 2009 Annual Meeting at the AmericanAssociation for the Advancement of Science.


6Statistics before and after my COPSS PrizePeter J. BickelDepartment of StatisticsUniversity of California, Berkeley, CAThis is largely an account of my research career, the development of differentinterests as I moved along, and the influences, people and ideas that determinedthem. It also gives an idiosyncratic view of the development of the fieldand ends with some words of advice.6.1 IntroductionI was fortunate enough to be young enough to receive the first COPSS prize in1979. It was a fairly rushed affair. I flew back from France where I was givingsome lectures, mumbled that I felt like the robber in a cops and robbers dramasince I didn’t feel I had done enough to deserve the prize, and then returnedto France the next day.This is partly the story of my life before and after the prize and my contributions,such as they were, but, more significantly, it describes my viewson the changes in the main trends in the field that occurred during the last30+ years. In addition, given my age of 72, I can’t resist giving advice.6.2 The foundation of mathematical statisticsDuring the period 1940 to 1979 an impressive theoretical edifice had beenbuilt on the foundations laid by Fisher, Pearson and Neyman up to the 1940sand then built up by Wald, Wolfowitz, LeCam, Stein, Chernoff, Hodges andLehmann, and Kiefer, among others, on the frequentist side and by L.J. Savageon the Bayesian side, with Herbert Robbins flitting in between. There were,of course, other important ideas coming out of the work of people such as59


60 Statistics before and after COPSS PrizeI.J. Good, H.O. Hartley, J.W. Tukey and others, some of which will come uplater. The focus of most of these writers was on classical inference applied tofairly standard situations, such as linear models, survey sampling and designedexperiments as explored in the decision-theoretic framework of Wald, whichgave a clear infrastructure for both frequentist and Bayesian analysis. This includedthe novel methodology and approach of sequential analysis, introducedby Wald in the 1950s, and the behavior of rank-based nonparametric tests andestimates based on them as developed by Hodges and Lehmann. Both of thesedevelopments were pushed by World War II work. Robustness considerationswere brought to the fore through Tukey’s influential 1958 paper “The Futureof Data Analysis” and then the seminal work of Hampel and Huber.6.3 My work before 1979My thesis with Erich Lehmann at the University of California (UC) Berkeleywas on a robust analogue of Hotelling’s T 2 test and related estimates.I then embarked on a number of contributions to many of the topics mentioned,including more robustness theory, Bayesian sequential analysis, curveestimation, asymptotic analysis of multivariate goodness-of-fit tests, and thesecond-order behavior of rank test power and U-statistics. Several of the papersarose from questions posed by friends and colleagues. Thus, some generalresults on asymptotic theory for sequential procedures as the cost of observationtended to zero was prompted by Yossi Yahav. The second-order analysis ofnonparametric tests grew out of a question asked by Hodges and Lehmann intheir fundamental paper published in 1970 (Hodges and Lehmann, 1970). Thequestion was picked up independently by van Zwet and myself and we thendecided to make common cause. The resulting work led to the development ofsecond-order theory for U-statistics by van Zwet, Götze and myself; see Bickel(1974) and subsequent papers. The work on curve estimation originated fromaquestionposedbyMurrayRosenblatt.I even made an applied contribution in 1976 as a collaborator in the analysisof “Sex bias in graduate admission at Berkeley” which appeared in Science(Bickel et al., 1975). Thanks to my colleague David Freedman’s brilliant textbookStatistics, it garnered more citations than all my other work. This wasinitiated by a question from Gene Hammel, Professor of Anthropology andthen Associate Dean of the Graduate Division.Two opportunities to work outside of UC Berkeley had a major impact onmy research interests, the second more than the first, initially.


P.J. Bickel 616.3.1 Imperial CollegeIn 1965–66 I took a leave of absence from UC Berkeley and spent a year atImperial College, London during the last year that George Barnard held theStatistics Chair. My job was mainly tutoring undergraduates in their classesin probability theory, but I became acquainted with John Copas in addition toBarnard at Imperial and with some of the other British statisticians and probabilists,including David Cox, Henry Daniels, Allan Stuart, Jim Durbin, thetwo very different Kendalls, David and Maurice, and others. The exposureto data analysis, after the theory-steeped Berkeley environment, was somewhatunsettling but I wonder if, at least subconsciously, it enabled me to dealsensibly with the sex bias in graduate admissions questions.IreturnedtoImperialin1975–76duringDavidCox’stenureaschair.I interacted with an active group of young people including Tony Atkinson,Agnes Herzberg, Ann Mitchell, and Adrian Smith. Agnes and I initiated somework on robust experimental design but my knowledge of martingales was toominimal for me to pursue David Cox’s suggestion to further explore the Coxsurvival analysis model. This again was an opportunity missed, but by thenI was already involved in so many collaborations that I felt unready to takeon an entirely new area.6.3.2 PrincetonMy second exposure to a very different environment came in 1970–71 when,on a Guggenheim Fellowship, I joined David Andrews, Peter Huber, FrankHampel, Bill Rogers, and John Tukey during the Princeton Robustness Year.There I was exposed for the first time to a major computational simulationeffort in which a large number of estimates of location were compared on alarge number of possible distributions. I found to my pleased surprise thatsome of my asymptotic theory based ideas, in particular, one-step estimates,really worked. On the other hand, I listened, but didn’t pay enough attention,to Tukey. If, for instance, I had followed up on a question of his on the behaviorof an iteration of a one-step estimate I had developed to obtain asymptoticanalogues of linear combinations of order statistics in regression, I might havepreceded at least some of the lovely work of Koenker and Basset on quantileregression. However, unexpected questions such as adaptation came up intalking to Peter Huber and influenced my subsequent work greatly.The moral of these stories is that it is very important to expose yourselfto new work when you’re young and starting out, but that their effects maynot be felt for a long time if one is, as I am, basically conservative.6.3.3 Collaboration and studentsMost of my best work in the pre-1979 period was with collaborators who oftendeserve much more credit than I do, including Yossi Yahav, W.R. van Zwet,


62 Statistics before and after COPSS PrizeM. Rosenblatt, my advisor and friend, E.L. Lehmann, and his favorite collaboratorwho brought me into statistics, J.L. Hodges, Jr. This trend continuedwith additional characters throughout my career. I should note that I havean unfair advantage over these friends since, with ‘B’ starting my name, I amalmost always listed first in the list of authors, though at that time much lessfuss was made about this in the mathematical sciences than now.I acquired PhD students rather quickly, partly for selfish reasons. I alwaysfound that I could think more clearly and quickly in conversation than insingle-handedly batting my head against a brick wall. More significantly, I liketo interact with different minds whose foci and manner of operation are quitedifferent from mine and whose knowledge in various directions is broader anddeeper.Thus my knowledge of invariance principles, concentration inequalities andthe like which led to the work on distribution-free multivariate tests came inpart out of working with Hira Koul on confidence regions for multivariatelocation based on rank tests.There are, of course, students who have had a profound effect on my researchdirections. Some of them became lifelong collaborators. I will name sixin advance. Their roles will become apparent later. There are others, such asJianqing Fan and Jeff Wu, who have played and are playing very importantroles in the field but whose interests have only occasionally meshed with mineafter their doctorates, though I still hope for more collaborations with them,too.(a) Ya’acov Ritov (PhD in Statistics from Hebrew University, Jerusalem, supervisedduring a 1979–80 sabbatical)(b) Elizaveta Levina (PhD in Statistics, UC Berkeley, 2002)(c) Katerina Kechris (PhD in Statistics, UC Berkeley, 2003)(d) Aiyou Chen (PhD in Statistics, UC Berkeley, 2004)(e) Bo Li (PhD in Statistics, UC Berkeley, 2006)(f) James (Ben) Brown (PhD in Applied Science and Technology, College ofEngineering, UC Berkeley, 2008).6.4 My work after 1979In most fields the amount, types and complexity of data have increased on anunprecedented scale. This, of course, originated from the increasing impactof computers and the development of refined sensing equipment. The rise incomputing capabilities also increased greatly the types of analysis we could


P.J. Bickel 63make. This was first perceived in statistics by Brad Efron, who introducedMonte Carlo in the service of inference by inventing the bootstrap. DavidFreedman and I produced some of the first papers validating the use of thebootstrap in a general context.6.4.1 Semiparametric modelsIn the 1980s new types of data, arising mainly from complex clinical trials,but also astronomy and econometrics, began to appear. These were calledsemiparametric because they needed both finite- and infinite-dimensional parametersfor adequate description.Semiparametric models had been around for some time in the form ofclassical location and regression models as well as in survival analysis andquality control, survey sampling, economics, and to some extent, astronomy.There had been theoretical treatments of various aspects by Ibragimov andKhasminskii, Pfanzagl, and Lucien LeCam at a high level of generality. Thekey idea for their analysis was due to Charles Stein. Chris Klaassen, Ya’acovRitov, Jon Wellner and I were able to present a unified viewpoint on thesemodels, make a key connection to robustness, and develop methods both forsemiparametric performance lower bounds and actual estimation. Our work,of which I was and am still proud, was published in book form in 1993. Muchdevelopment has gone on since then through the efforts of some of my coauthorsand others such as Aad van der Vaart and Jamie Robins. I workedwith Ritov on various aspects of semiparametrics throughout the years, andmention some of that work below.6.4.2 Nonparametric estimation of functionsIn order to achieve the semiparametric lower bounds that we derived it becameclear that restrictions had to be placed on the class of infinite dimensional“nuisance parameters.” In fact, Ritov and I were able to show in a particularsituation, the estimation of the integral of the square of a density, that eventhough the formal lower bounds could be calculated for all densities, efficientestimation of this parameter was possible if and only if the density obeyed aLipschitz condition of order larger than 1/4 and √ n estimation was possibleif and only if the condition had an exponent greater than or equal to 1/4.In the mid-’80s and ’90s David Donoho, Iain Johnstone and their collaboratorsintroduced wavelets to function estimation in statistics; see, e.g.,Donoho et al. (1996). With this motivation they then exploited the Gaussianwhite noise model. This is a generalization of the canonical Gaussian linearmodel, introduced by Ibragimov and Khasminskii, in which one could quantitativelystudy minimax analysis of estimation in complex function spaceswhose definition qualitatively mimics properties of functions encountered inthe real world. Their analysis led rather naturally to regularization by thresholding,a technique which had appeared in some work of mine on procedures


64 Statistics before and after COPSS Prizewhich could work well in the presence of bias (Bickel, 1981) although I was farfrom putting this work in its proper context. Their work in this area, earlierwork on detecting sparse objects (Donoho et al., 1992), and earlier work ofStone (1977) made it apparent that, without much knowledge of a statisticalproblem, minimax bounds indicated that nothing much could be achieved ineven moderate dimensions.On the other hand, a branch of computer science, machine learning, haddeveloped methodology such as neural nets, and on the statistics side, LeoBreiman and Jerry Friedman, working with Richard Olshen and CharlesStone, developed CART. Both of these methods of classification use very highdimensional predictors and, relative to the number of predictors, small trainingsets. These methods worked remarkably well, far better than the minimaxtheory would lead us to believe. These approaches and a plethora ofother methods developed in the two communities, such as Boosting, RandomForests, and above all “lasso” driven methods involve, implicitly or explicitly,“regularization,” which pulls solutions of high dimensional optimizationproblems towards low dimensional spaces.In many situations, while we know little about the problem, if we canassume that, in an appropriate representation, only a relatively few majorfactors matter, then theorists can hope to reconcile the “Curse of Dimensionality”minimax results with the observed success of prediction methods basedon very high dimensional predictor sets.Under the influence of Leo Breiman I became very aware of these developmentsand started to contribute, for instance, to the theory of boosting inBickel et al. (2006). I particularly liked a simple observation with Bo Li, growingout of my Rietz lecture (Bickel and Li, 2007). If predictors of dimensionp are assumed to lie on an unknown smooth d-dimensional manifold of R pwith d ≪ p, thenthedifficultyofthenonparametricregressionproblemisgoverned not by p but by d, providedthatregularizationisdoneinasuitablydata-determined way; that is, bandwidth selection is done after implicit orexplicit estimation of d.6.4.3 Estimating high dimensional objectsMy views on the necessary existence of low dimensional structure were greatlystrengthened by working with Elizaveta Levina on her thesis. We worked withJitendra Malik, a specialist in computer vision, and his students, first in analyzingan algorithm for texture reconstruction developed by his then student,Alexei Effros, and then in developing some algorithms for texture classification.The first problem turned out to be equivalent to a type of spatialbootstrap. The second could be viewed as a classification problem based onsamples of 1000+ dimensional vectors (picture patches) where the goal wasto classify the picture from which the patches were taken into one of severalclasses.


P.J. Bickel 65We eventually developed methods which were state-of-the-art in the fieldbut the main lesson we drew was that just using patch marginal distributions(a procedure sometimes known as naive Bayes) worked better than trying toestimate joint distributions of patch pixels.The texture problem was too complex to analyze further, so we turned toasimplerprobleminwhichexplicitasymptoticcomparisonscouldbemade:classifying a new p-dimensional multivariate observation into one of two unknownGaussian populations with equal covariance matrices, on the basis of asample of n observations from each of the two populations (Bickel and Levina,2004). In this context we compared the performance of(i) Fisher’s linear discriminant function(ii) Naive Bayes: Replace the empirical covariance matrix in Fisher’s functionby the diagonal matrix of estimated variances, and proceed as usual.We found that if the means and covariance matrices range over a sparselyapproximable set and we let p increase with n, so that p/n →∞,thenFisher’srule (using the Moore–Penrose inverse) performed no better than randomguessing while naive Bayes performed well, though not optimally, as long asn −1 log p → 0.The reason for this behavior was that, with Fisher’s rule, we were unnecessarilytrying to estimate too many covariances. These results led us — Levinaand I, with coworkers (Bickel and Levina, 2008) — to study a number of methodsfor estimating covariance matrices optimally under sparse approximationassumptions. Others, such as Cai et al. (2010), established minimax boundson possible performance.At the same time as this work there was a sharp rise of activity in trying tounderstand sparsity in the linear model with many predictors, and a numberof important generalizations of the lasso were proposed and studied, suchas the group lasso and the elastic net. I had — despite appearing as firstauthor — at most a supporting part in this endeavor on a paper with Ritovand Tsybakov (Bickel et al., 2009) in which we showed the equivalence of aprocedure introduced by Candès and Tao, the Danzig selector, with the morefamiliar lasso.Throughout this period I was (and continue to be) interested in semiparametricmodels and methods. An example I was pleased to work on with mythen student, Aiyou Chen, was Independent Component Analysis, a methodologyarising in electrical engineering, which had some clear advantages overclassical PCA (Chen and Bickel, 2006). Reconciling ICA and an extensionwith sparsity and high dimension is a challenge I’m addressing with anotherstudent.A more startling and important analysis is one that is joint with Bin Yu,several students, and Noureddine El Karoui (Bean et al., 2013; El Karouiet al., 2013), whose result appears in PNAS. We essentially studied robustregression when p/n → c for some c ∈ (0, 1), and showed that, contrary to


66 Statistics before and after COPSS Prizewhat is suggested in a famous paper of Huber (Huber, 1973), the asymptoticnormality and 1/ √ n regime carries over, with a few very significant exceptions.However, limiting variances acquire a new Gaussian factor. As Huberdiscovered, contrasts with coefficients dependent on the observed predictorsmay behave quite differently. Most importantly, the parameters of the Gaussianlimits depend intricately on the nature of the design, not simply on thecovariance matrix of the predictors. This work brought out that high dimensionreally presents us with novel paradigms. For p very large, garden varietymodels exhibit strange properties. For instance, symmetric p-variate Gaussiandistributions put almost all of their mass on a thin shell around the border ofthe sphere of radius √ p that is centered at the mean. It was previously notedthat for inference to be possible one would like mass to be concentrated onlow dimensional structures. But finding these structures and taking the searchprocess into account in inference poses very new challenges.6.4.4 NetworksMy latest interest started around 2008 and is quite possibly my last theoreticalarea of exploration. It concerns inference for networks, a type of dataarising first in the social sciences, but which is now of great interest in manycommunities, including computer science, physics, mathematics and, last butnot least, biology.Probabilistically, this is the study of random graphs. It was initiated byErdős and Rényi (1959). If you concentrate on unlabeled graphs, which is sofar essentially the only focus of the probability community, it is possible toformulate a nonparametric framework using work of Aldous and Hoover whichpermits the identification of analogues of i.i.d. variables and hence providesthe basis of inference with covariates for appropriate asymptotics (Bickel andChen, 2009). Unfortunately, fitting even the simplest parametric models bymaximum likelihood is an NP-hard problem. Nevertheless, a number of simplefitting methods based on spectral clustering of the adjacency or Laplacianmatrices (Rohe et al., 2011; Fishkind et al., 2013), combined with other ideas,seem to work well both in theory and practice. An interesting feature of thistheory is that it ties into random matrix theory, an important and very activefield in probability and mathematics with links to classical Gaussian multivariatetheory which were recently discovered by Iain Johnstone and his students(Johnstone, 2008).In fact, more complex types of models and methods, all needing covariates,are already being used in an ad hoc way in many applications. So there’s lotsto do.6.4.5 GenomicsFollowing work initiated in 2005 with my then student, Katerina Kechris, anda colleague in molecular biology, Alex Glazer, I began to return to a high


P.J. Bickel 67school interest, biology. After various missteps I was able to build up a groupwith a new colleague, Haiyan Huang, supported initially by an NSF/NIGMSgrant, working on problems in molecular biology with a group at the LawrenceBerkeley Lab. Through a series of fortunate accidents our group became theonly statistics group associated with a major multinational effort, the EN-CODE (Encyclopaedia of DNA) project. The end product of this effort, apartfrom many papers in Nature, Science, Genome Research and the like, was aterabyte of data (Birney et al., 2007).Iwasfortunateenoughtoacquireastudent,James(Ben)Brown,froman engineering program at Berkeley, who had both an intense interest in, andknowledge of, genomics and also the critical computational issues that arean integral part of such a collaboration. Through his participation, I, andto a considerable extent, Haiyan, did not need to immerse ourselves fully inthe critical experimental issues underlying a sensible data analysis. Ben couldtranslate and pose old and new problems in terms we could understand.The collaboration went on for more than five years, including a pilotproject. During this time Ben obtained his PhD, continued as a postdoc andis now beset with job offers from computational biology groups at LBL andall over. Whether our group’s participation in such large scale computationalefforts can continue at the current level without the kind of connection toexperimentalists provided by Ben will, I hope, not be tested since we all wishto continue to collaborate.There have been two clearly measurable consequences of our participation.(a) Our citation count has risen enormously as guaranteed by participationin high-visibility biology journals.(b) We have developed two statistical methods, the GSC (Genome StructuralCorrection) and the IDR (the Irreproducible Discovery Rate) which haveappeared in The Annals of Applied Statistics (Bickel et al., 2010; Li et al.,2011) and, more significantly, were heavily used by the ENCODE consortium.6.5 Some observationsOne of the things that has struck me in writing this is that “old ideas neverdie” and they may not fade away. Although I have divided my interests intocoherent successive stages, in fact, different ideas frequently reappeared.For instance, second-order theory and early papers in Bayes procedurescombined in a paper with J.K. Ghosh in 1990 (Bickel and Ghosh, 1990) whichgave what I still view as a neat analysis of a well-known phenomenon calledthe Bartlett correction. Theoretical work on the behavior of inference in HiddenMarkov Models with Ritov and Ryden (Bickel et al., 1998) led to a study


68 Statistics before and after COPSS Prizewhich showed how difficult implementation of particle filters is in the atmosphericsciences (Snyder et al., 2008). The work in HMM came up again inthe context of traffic forecasting (Bickel et al., 2007) and some work in astrophysics(Meinshausen et al., 2009). Both papers were close collaborationswith John Rice and the second included Nicolai Meinshausen as well. Theearly bootstrap work with Freedman eventually morphed into work on the mout of n bootstrap with Götze and van Zwet (Bickel et al., 1997) and finallyinto the Genome Structural Correction Method (Bickel et al., 2010).Another quite unrelated observation is that to succeed in applications onehas to work closely with respected practitioners in the field. The main reasonfor this is that otherwise, statistical (and other mathematical science)contributions are dismissed because they miss what practitioners know is theessential difficulty. A more pedestrian reason is that without the imprimaturand ability to translate of a respected scientist in the field of application, statisticalpapers will not be accepted in the major journals of the science andhence ignored.Another observation is that high-order computing skills are necessary tosuccessfully work with scientists on big data. From a theoretical point of view,the utility of procedures requires not only their statistical, but to an equalextent, their computational efficiency. Performance has to be judged throughsimulations as well as asymptotic approximations.IfreelyconfessthatIhavenotsubscribedtotheprincipleofhoningmyowncomputational skills. As a member of an older generation, I rely on youngerstudents and collaborators for help with this. But for people starting theircareers it is essential. The greater the facility with computing, in addition toR and including Matlab, C++, Python or their future versions, the better youwill succeed as a statistician in most directions.As I noted before, successful collaboration requires the ability to reallyunderstand the issues the scientist faces. This can certainly be facilitated bydirect study in the field of application.And then, at least in my own career, I’ve found the more mathematicsIknew,fromprobabilitytofunctionalanalysistodiscretemathematics,thebetter. And it would have been very useful to have learned more informationtheory, statistical physics, etc., etc.Of course I’m describing learning beyond what can be done or is desirablein a lifetime. (Perhaps with the exception of John von Neumann!) We all specializein some way. But I think it’s important to keep in mind that statisticsshould be viewed as broadly as possible and that we should glory in this timewhen statistical thinking pervades almost every field of endeavor. It is reallyalotoffun.


P.J. Bickel 69ReferencesBean, D., Bickel, P.J., El Karoui, N., and Yu, B. (2013). Optimal M-estimation in high-dimensional regression. Proceedings of the NationalAcademy of Sciences, 110:14563–14568.Bickel, P.J. (1974). Edgeworth expansions in non parametric statistics. TheAnnals of Statistics, 2:1–20.Bickel, P.J. (1981). Minimax estimation of the mean of a normal distributionwhen the parameter space is restricted. The Annals of Statistics, 9:1301–1309.Bickel, P.J., Boley, N., Brown, J.B., Huang, H., and Hang, N.R. (2010). Subsamplingmethods for genomic inference. The Annals of Applied Statistics,4:1660–1697.Bickel, P.J. and Chen, A. (2009). A nonparametric view of network modelsand Newman–Girvan and other modularities. Proceedings of the NationalAcademy of Sciences, 106:21068–21073.Bickel, P.J., Chen, C., Kwon, J., Rice, J., van Zwet, E., and Varaiya, P.(2007). Measuring traffic. Statistical Science, 22:581–597.Bickel, P.J. and Ghosh, J.K. (1990). A decomposition for the likelihood ratiostatistic and the Bartlett correction — a Bayesian argument. The Annalsof Statistics, 18:1070–1090.Bickel, P.J., Götze, F., and van Zwet, W.R. (1997). Resampling fewer thann observations: Gains, losses, and remedies for losses. Statistica Sinica,7:1–31.Bickel, P.J., Hammel, E.A., O’Connell, J.W. (1975). Sex bias in graduateadmissions: Data from Berkeley. Science, 187:398–404.Bickel, P.J. and Levina, E. (2004). Some theory of Fisher’s linear discriminantfunction, ‘naive Bayes’, and some alternatives when there are many morevariables than observations. Bernoulli, 10:989–1010.Bickel, P.J. and Levina, E. (2008). Regularized estimation of large covariancematrices. The Annals of Statistics, 36:199–227.Bickel, P.J. and Li, B. (2007). Local polynomial regression on unknown manifolds.In Complex Datasets and Inverse Problems. IMS Lecture Notes vol.54, Institute of Mathematical Statistics, Beachwood, OH, pp. 177–186.Bickel, P.J., Ritov, Y., and Rydén, T. (1998). Asymptotic normality of themaximum-likelihood estimator for general hidden Markov models. TheAnnals of Statistics, 26:1614–1635.


70 Statistics before and after COPSS PrizeBickel, P.J., Ritov, Y., and Tsybakov, A.B. (2009). Simultaneous analysis oflasso and Dantzig selector. The Annals of Statistics, 37:1705–1732.Bickel, P.J., Ritov, Y., and Zakai, A. (2006). Some theory for generalizedboosting algorithms. Journal of Machine Learning Research, 7:705–732.Birney, E., Stamatoyannopoulos, J.A., Dutta, A., Guigó, R., Gingeras, T.R.,Margulies, E.H., Weng, Z., Snyder, M., Dermitzakis, E.T. et al. (2007).Identification and analysis of functional elements in 1% of the humangenome by the encode pilot project. Nature, 447:799–816.Cai, T.T., Zhang, C.-H., and Zhou, H.H. (2010). Optimal rates of convergencefor covariance matrix estimation. The Annals of Statistics, 38:2118–2144.Chen, A. and Bickel, P.J. (2006). Efficient independent component analysis.The Annals of Statistics, 34:2825–2855.Donoho, D.L., Johnstone, I.M., Hoch, J.C., and Stern, A.S. (1992). Maximumentropy and the nearly black object (with discussion). Journal of theRoyal Statistical Society, Series B, 54:41–81.Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1996).Density estimation by wavelet thresholding. The Annals of Statistics,24:508–539.El Karoui, N., Bean, D., Bickel, P.J., Lim, C., and Yu, B. (2013). On robustregression with high-dimensional predictors. Proceedings of the NationalAcademy of Sciences, 110:14557–14562.Erdős, P. and Rényi, A. (1959). On random graphs. I. Publicationes MathematicaeDebrecen, 6:290–297.Fishkind, D.E., Sussman, D.L., Tang, M., Vogelstein, J.T., and Priebe, C.E.(2013). Consistent adjacency-spectral partitioning for the stochastic blockmodel when the model parameters are unknown. SIAM Journal of MatrixAnalysis and Applications, 34:23–39.Hodges, J.L., Jr. and Lehmann, E.L. (1970). Deficiency. The Annals of MathematicalStatistics, 41:783–801.Huber, P.J. (1973). Robust regression: Asymptotics, conjectures and MonteCarlo. The Annals of Statistics, 1:799–821.Johnstone, I.M. (2008). Multivariate analysis and Jacobi ensembles: Largesteigenvalue, Tracy–Widom limits and rates of convergence. The Annals ofStatistics, 36:2638–2716.Li, Q., Brown, J.B., Huang, H., and Bickel, P.J. (2011). Measuring reproducibilityof high-throughput experiments. The Annals of Applied Statistics,5:1752–1779.


P.J. Bickel 71Meinshausen, N., Bickel, P.J., and Rice, J. (2009). Efficient blind search:Optimal power of detection under computational cost constraints. TheAnnals of Applied Statistics, 3:38–60.Rohe, K., Chatterjee, S., and Yu, B. (2011). Spectral clustering and the highdimensionalstochastic blockmodel. The Annals of Statistics, 39:1878–1915.Snyder, C., Bengtsson, T., Bickel, P.J., and Anderson, J. (2008). Obstaclesto high-dimensional particle filtering. Monthly Weather Review,136:4629–4640.Stone, C.J. (1977). Consistent nonparametric regression (with discussion).The Annals of Statistics, 5:595–645.


7The accidental biostatistics professorDonna J. BroganDepartment of Biostatistics and BioinformaticsEmory University, Atlanta, GASeveral chapters in this book summarize the authors’ career paths after completionof graduate school. My chapter includes significant childhood and earlyadult experiences that coalesced into eventually completing a PhD degree instatistics. I also summarize some highlights of my biostatistics academic careerpost PhD at two universities over 37 years. My educational and careerpaths had twists and turns and were not planned in advance, but an underlyingtheme throughout was my strong interest and ability in mathematics andstatistics.7.1 Public school and passion for mathematicsI grew up in a working class neighborhood in Baltimore and loved schoolfrom the moment I entered. I was a dedicated and conscientious student andreceived encouragement from many teachers. My lifelong interest in math andmy perseverance trait developed at an early age, as the following vignettesillustrate.As a nine year old in 1948 I rode the public bus each month to a local bank,clutching tightly in my fist $50 in cash and a passbook, in order to make themortgage payment for a row house in which I lived. I asked the tellers questionsover time about the passbook entries. Once I grasped some basic ideas, I didsome calculations and then asked why the mortgage balance was not reducedeach month by the amount of the mortgage payment. A teller explained tome about interest on loans of money; it sounded quite unfair to me.Two years later my sixth grade teacher, Mr. Loughran, noted my mathematicalability and, after school hours, taught me junior high and high schoolmathematics, which I loved. He recommended me for admission to the onlyaccelerated junior high school in Baltimore where I completed in two yearsthe work for 7th, 8th, and 9th grades.73


74 Accidental biostatistics professorMy maternal grandfather, whose public education ended after 8th grade,fanned my passion for math and analysis by showing me math puzzles andtricks, how to calculate baseball statistics like RBI, and how to play checkersand chess, in which he was a local champion.Ichosetheunpopularacademictrackinmyinnercityworkingclasshighschool simply because it offered the most advanced math courses. I had noplans to go to college.I decided that typing would be a useful skill but was denied enrollmentbecause I was not in the commercial track. However, I persisted and wasenrolled. When personal computers appeared a few decades later, I was fastand accurate on the keyboard, unlike most of my male academic colleagues.A gifted math teacher in 11th and 12th grades, Ms. Reese, gave a 10-minute drill (mini-test) to students at the beginning of each daily class. Sheencouraged my math interest and challenged me daily with a different andmore difficult drill, unknown to other students in the class.A female high school counselor, Dr. Speer, strongly advised me to go tocollege, a path taken by few graduates of my high school and no one in myimmediate family. I applied to three schools. I won substantial scholarships totwo state schools (University of Maryland and Western Maryland) but choseto attend Gettysburg College, in Pennsylvania, with less financial aid, becauseit was smaller and seemed less intimidating to me.7.2 College years and discovery of statisticsCollege was my first exposure to middle class America. I majored in math andplanned to be a high school math teacher of the caliber of Ms. Reese. However,Irashlydiscardedthisgoalinmysophomoreyearafterdislikingintenselymyfirst required education course. In my junior year I became aware of statisticsvia two math courses: probability and applied business statistics. However,two courses during my senior year solidified my lifelong interest in statistics:mathematical statistics and abnormal psychology.A new two-semester math-stat course was taught by a reluctantDr. Fryling, the only math (or college) faculty member who had studied statisticaltheory. He commented frequently that he felt unqualified to teach thecourse, but I thought he did a great job and I was wildly excited about thetopic. I worked all assigned problems in the textbook and additional onesout of general interest. When midterm exam time approached, Dr. Frylingstated that he did not know how to construct an exam for the course. Withoutthinking, and not yet having learned the social mores of college life, myhand shot up, and I said that I could construct a good exam for the course.The other students noticeably groaned. After class Dr. Fryling discussed withme my unorthodox suggestion and took me up on my offer. After reviewing


D.J. Brogan 75my prepared exam and answer key, he accepted it. I did not take the exam, ofcourse, and he graded the students’ answers. We continued this arrangementfor the rest of the year.In my abnormal psychology course that same year we were assigned toread in the library selected pages from the Kinsey books on human sexualbehavior (Kinsey et al., 1948, 1953). The assigned readings were not all thatinteresting, but I avidly read in each book the unassigned methods chapterthat discussed, among other things, statistical analysis strategy and samplingissues (i.e., difficulty in obtaining a representative sample of people who werewilling to answer sensitive questions about their sexual behavior). This was myfirst exposure to sampling theory applications, which eventually evolved intomy statistical specialty. Sometime later I read a critique of the 1948 Kinseybook methodology by Cochran et al. (1954); this book is a real education insampling and data analysis, and I highly recommend it.Dr. Fryling, noting my blossoming fascination with statistics, asked meabout my career plans. I had none, since giving up secondary school teaching,but mentioned physician and actuary as two possibilities, based on my scienceand statistics interests. He advised that a medicine career was too hard for awoman to manage with family life and that the actuarial science field was notfriendly to women. I accepted his statements without question. Neither one ofus had a strong (or any) feminist perspective at the time; the second wave offeminism in the United States was still ten years into the future.A fellow male math major had suggested that I go into engineering sinceIwasgoodatmath.Ididnotknowwhatengineeringwasanddidnotinvestigateit further; I thought an engineer was the person who drove the train.Even though I was passionate about math and statistics and performed well inthem, there were obvious gaps in my general education and knowledge; somefamily members and friends say this is still true today.Dr. Fryling strongly recommended that I apply for a Woodrow WilsonNational Fellowship, with his nomination, and pursue a doctoral degree instatistics or math. These competitive fellowships were prestigious and providedfull graduate school funding for persons who planned a college teachingcareer. I had not considered such a career, nor did it appeal to me, perhapsbecause I never saw a female faculty member at Gettysburg College exceptfor girls’ physical education. Although Dr. Fryling indicated that I would notbe legally bound to teach college by accepting a Wilson fellowship, I felt thatit would not be appropriate to apply when I had no intention of becomingacollegeteacher.OtherWilsonapplicantsmaynothavebeensoscrupulousabout “the rules.” Looking back now, the Wilson fellowship was excellent advice,but limited self awareness of my own talents and interests prevented mefrom taking this opportunity.


76 Accidental biostatistics professor7.3 Thwarted employment search after collegeHaving discarded high school and college teaching, actuarial science, andmedicine, I sought employment after college graduation in 1960. I was awareof only two methods to find a job: look in the newspapers’ “Help Wanted”sections and talk with employers at job fairs on campus.The newspaper route proved fruitless. Younger readers may not be awarethat newspapers had separate “Help Wanted Female” and “Help WantedMale” sections until the late 1960s or early 1970s when such practice eventuallywas ruled to be illegal sex discrimination. In 1960 advertised positionsusing math skills and interest were in “Help Wanted Male,” and I assumedthat it would be futile to apply. Job interviews on campus with employersplayed out similarly; all positions were segregated by gender and all technicalpositions were for males. One vignette, among many, illustrates the employmentculture for women in the US in 1960.When I registered for an interview on campus with IBM, I was required totake a math aptitude test. The IBM interviewer commented that he had neverseen such a high score from any applicant and offered me either a secretarialor an entry sales position. I countered that I was interested in their advertisedtechnical positions that required a math background, especially given my scoreon their math aptitude test, but he simply said that those positions were formales. End of conversation.7.4 Graduate school as a fallback optionAlthough it is hard for me to believe now, I did not view my failed employmentsearch in 1960 to be the result of systematic societal sex discrimination againstwomen in employment. Rather, I concluded that if I were more qualified,I would be hired even though I was female.Thus, I decided to pursue a Master’s degree in statistics. Looking back,my search for graduate schools seems naive. I scanned available college catalogsat the Gettysburg College library, identified schools that had a separatestatistics department, and applied to three that somehow appealed to me. Allthree accepted me and offered financial aid: University of Chicago, ColumbiaUniversity, and Purdue University. I chose Purdue because it was the leastexpensive for me after credit from financial aid.


D.J. Brogan 777.5 Master’s degree in statistics at PurdueUpon arriving at the Purdue Statistics Department in fall of 1960, three newgraduate students (including me) chose the M.S. applied statistics track whilethe many remaining new stat graduate students chose the mathematical statisticstrack. After one semester I noted that more than half of the math stattrack students switched to the applied statistics track. I had a fleeting concernthat I might have chosen a “flunky” track, but I loved it and continued on.At the end of first semester I married a Purdue graduate student in English.As part of my financial aid, I assisted in teaching undergraduate calculuscourses and gained valuable instructor experience which supplemented myextensive math tutoring experience begun as an undergraduate. I began tothink that teaching college might not be a bad idea after all.7.6 Thwarted employment search after Master’s degreeAfter my husband and I completed our Purdue Master’s degrees in 1962, wemoved to Ames, Iowa, where he began a faculty position in the English Departmentat Iowa State University (ISU). I visited the ISU Statistics Departmentto inquire about employment opportunities and was offered a technical typistposition. My interviewer was enthusiastic because I would understand manyformulas and thus make fewer typing errors. Upon inquiring about positionsusing my statistical skills, I was told that no statistical staff positions wereavailable. Ames was a small town, so I searched in Des Moines, about 35 milesaway. I was able to find only clerical or secretarial positions; all technical positionswere reserved for males.7.7 Graduate school again as a fallback optionSince I was living in Ames, home to one of the best statistics departmentsin the country, I decided to take additional courses to become more qualifiedfor a statistical position. I was not allowed to take ISU courses unless I was adegree seeking student; thus I applied for the statistics doctoral program. Notonly was I accepted by the same department that had offered me a statisticaltypist position, but I was awarded a prestigious and competitive universitywidedoctoral fellowship for one year that paid all expenses and an attractivestipend. My daughter Jennifer was born at the end of my first year at ISU.


78 Accidental biostatistics professorFor my second and subsequent years at ISU the department appointed meto a National Institute of Health (NIH) biostatistics traineeship that paid allexpenses and an attractive stipend. I had never heard the word biostatistics.I especially enjoyed my ISU sampling courses, building upon my initialinterest in this topic from the Kinsey et al. (1948, 1953) reports and theCochran et al. (1954) critique. Most of the other doctoral students dislikedsampling: boring topic and too many formulas. I found sampling fascinating,but I frequently have been known for being out of the mainstream.During the summer following my second ISU year my traineeship paid forme to take courses in biostatistics and epidemiology at the School of PublicHealth at University of North Carolina at Chapel Hill, since ISU did not offerthese courses. I began to understand the scope of biostatistics. The applicationof statistical theory and methods to public health and medicine appealed tome, combining my then current interests in statistics and psychology with myearlier interest in medicine.Now that I had taken all of the required coursework for a statistics doctoraldegree, fulfilling my limited objective of learning more about statisticsto become more employable, I decided to take the scheduled doctoral exams.If I did well, I would continue on to finish the work for a PhD, i.e., write a dissertation.To my surprise, I received the George Snedecor Award for the mostoutstanding PhD candidate that year, based on doctoral exam performance,and shared the award with another student because we were tied.7.8 Dissertation research and family issuesCompleting my dissertation took longer than anticipated due to academic andfamily issues. I began my dissertation research a few months after my doctoralexams and the birth of my son Jeffrey. Unfortunately, he was diagnosed withstomach cancer shortly thereafter and had a limited life expectancy. After oneyear’s work on a dissertation topic that had been chosen for me, I discardedmy limited research results, feeling that I was not a good match for the topic orfor the dissertation advisor. I took a six-month leave of absence from graduateschool to spend more time with my two children.Upon returning to school I requested, and was granted, permission by thedepartment to change my dissertation advisor and topic, an unusual occurrence.I felt this strategy was the only way I would ever finish my degree.I began working with Dr. Joseph Sedransk on a sampling problem of interestto me and, with his expert guidance and assistance, completed my dissertationin a little over one year in summer of 1967. My son died during the middle ofthis dissertation work, a few days before his second birthday.This clearly was a difficult time period for me and my family, and I appreciatevery much the support given to me by the ISU Department of Statistics.


D.J. Brogan 797.9 Job offers — finally!I planned to move to Chapel Hill after finishing my PhD because my husbandhad been accepted at the University of North Carolina (UNC) as a linguisticsdoctoral student for fall of 1967. Early that year I contacted the UNCBiostatistics Department and the Duke University Medical Center to inquireabout available positions. Surprisingly, each school invited me for an interview.I visited both schools and gave a seminar about my dissertation resultsto date.Within the next several weeks I received from each school an attractiveoffer of a tenure-track Assistant Professor position. After much deliberationI chose UNC, primarily because of an interesting and unique opportunitythere. Dr. Bernard Greenberg, Biostatistics Chair, offered to appoint me asthe director of an already funded training grant in the department from NationalInstitute of Mental Health (NIMH) to develop and implement an MSPHprogram in mental health statistics. This offer combined my interests in statisticsand psychology; I had minored in psychology for both graduate degrees,including some psychometrics during my doctoral studies.7.10 Four years at UNC-Chapel HillUpon arrival at UNC another newly hired Assistant Professor and I met witha human resources specialist to review our fringe benefits. At the end of themeeting, the specialist informed the other faculty member that he would receivean attractive disability insurance policy paid for by UNC. When I inquiredif I would receive this fringe benefit, the male specialist answered no,explaining that women don’t need disability insurance since their husbandstake care of them financially. It did not matter to him, or the university, thatI was the wage earner for the family since my husband was a full-time graduatestudent.Finally I began to recognize these frequent occurrences as sex discrimination.I joined a women’s liberation group in Chapel Hill, and my feministconsciousness was raised indeed. I became an activist on women’s barriers toemployment and education, primarily within the American Statistical Association(ASA) but also at UNC. With others I founded the Caucus for Womenin Statistics in 1971 and served as its president the first three years. ConcurrentlyI spearheaded the formation of the ASA Committee on Women inStatistics (COWIS) and served as a member in its early days. These and lateractions were the basis for my receiving the COPSS Elizabeth Scott Awardin 1994.


80 Accidental biostatistics professorAt UNC I worked with collaborators in mental health and psychiatry todevelop, implement and administer the MSPH training program in mentalhealth statistics, and I created and taught three new courses in this track(Brogan and Greenberg, 1973). Looking back, it seems to have been an unusualresponsibility to be given to a brand new assistant professor just onemonth after PhD completion. However, Dr. Greenberg and my mental healthcolleagues seemed confident that I could do it, and I enjoyed the challenge ofcreating a new MSPH track.During my fourth year at UNC, I wrote a grant application to NIMH tocontinue the MSPH program in mental health statistics but also to extend itto a doctoral level training program. NIMH funded this training grant, andmy salary support within the department was covered for another five years.However, I had a few concerns about what appeared to be the opportunityfor a potentially stellar academic future. First, my departmental teaching wasrestricted to the specialized mental health statistics courses that I createdsince I was the only faculty person who could (or would) teach them. I feltthat I wanted more variety in my teaching. Second, although the extensionof the training program to the doctoral level was a fantastic opportunity todevelop further the niche into which I had fortuitously fallen, and hopefully tomake substantial and needed contributions therein, I began to feel that I wasin a niche. For some reason I did not like the feeling of being so specialized.Finally, I had tired of living in small college towns for the past 15 years andwas interested in locating to a metropolitan area, especially since my husbandand I had recently divorced.7.11 Thirty-three years at Emory UniversityIn what might have seemed to be irrational behavior to some of my UNC-Biostatistics colleagues, I accepted a position in fall of 1971 at Emory UniversitySchool of Medicine in the small and fledgling Department of Statisticsand Biometry, its first ever female faculty member. Emory transformed itselfover subsequent decades into a world destination university, including the formationin 1990 of the Rollins School of Public Health (RSPH), currently oneof the top-tier public health schools in the country. I was one of only a fewfemale faculty members in RSPH upon its formation and the only female FullProfessor.At Emory I had ample opportunity to be a biostatistical generalist by conductingcollaborative research with physicians and other health researchers indifferent disciplines. My collaborative style was involvement with almost allaspects of the research project rather than only the purely biostatistical components,primarily because I was interested in the integrity of the data that


D.J. Brogan 81I would be analyzing later. I enjoyed working with a few collaborators over extendedtime periods, including a medical sociologist colleague for thirty years.In the early 1980s, I used my sample survey skills in an NHLBI fundedmulti-site collaborative contract where I designed and implemented area probabilitysamples of adults in Georgia in order to estimate hypertension relatedparameters. I learned several nitty-gritty applied sampling techniques not intextbooks from the sample survey statisticians at the other sites and first usedthe SUDAAN software for analysis of complex survey data.In the mid 1980s, I was diagnosed with breast cancer. My personal experienceand my biostatistics background combined to make me a useful contributorto the founding group of the breast cancer advocacy movement, culminatingin the formation of the National Breast Cancer Coalition (NBCC) andsimilar organizations.I served as Biostatistics chair in RSPH in the early 1990s, the first everfemale chair of the school. The so-called power of the position (money andspace, primarily) did not interest me. Rather, I attempted to maintain a collegialdepartment that was successful in the typical academic arenas and wassupportive for each of its members (faculty, students, and staff). My besttraining for the chair position was a few years that I had spent in group therapyin earlier decades (another way of saying that my training was minimal).After three years I resigned as chair because academic administration tookme away from what I really loved: being a practicing biostatistician.During the early 1990s, I began to teach continuing education workshopson analysis of complex survey data at summer programs in biostatistics andepidemiology (e.g., University of Michigan), at government agencies such asCDC and at annual meetings of health researchers. I continue this teachingtoday, even after retirement, because I enjoy it. To date I have taught about130 of these workshops to over 3000 participants.Upon my retirement from Emory in 2004 the Biostatistics Departmentand the RSPH sponsored a gala celebration with 140 guests, an exquisite sitdowndinner, and a program with many speakers who reviewed aspects of myprofessional life. I felt quite honored and much loved.7.12 Summing up and acknowledgementsI enjoyed immensely my unintended academic career in biostatistics and highlyrecommend the discipline to those who are interested and qualified. I liked thediverse areas in which I worked as biostatistical collaborator, in essence acquiringa mini medical education. I found teaching for very different audiencesto be great fun: graduate students in biostatistics and the health sciences,health professionals, and health researchers. It took a while to find my enjoyablestatistical niche: sample survey statistician. I was able to combine some


82 Accidental biostatistics professormajor aspects of my personal life, feminism and breast cancer history, withcollaborative research and activism. I regret having had less enthusiasm forbiostatistical methodological research and was not as productive in this areaas I would have liked.I am grateful to many people and institutions for helping me to prepare forand navigate my career, some mentioned above. There are too many people tomention individually here, but one must be recognized. I am indebted to myex-husband Dr. Charles Ruhl for his strong support and encouragement of myeducational and career goals; for his crucial role in our family life, especiallyduring Jeffrey’s illness; for living in Iowa longer than he wanted so that I couldfinish my PhD degree; for encouraging me to join a women’s liberation group;and for being a feminist long before I knew what the word meant.ReferencesBrogan, D.R. and Greenberg, B.G. (1973). An educational program in mentalhealth statistics. Community Mental Health Journal, 9:68–78.Cochran, W.G., Mosteller, F., and Tukey, J.W. (1954). Statistical Problemsof the Kinsey Report on Sexual Behavior in the Human Male, AmericanStatistical Association, Washington, DC.Kinsey, A.C., Pomeroy, W.B., and Martin, C.E. (1948). Sexual Behavior inthe Human Male. W.B.SaundersCompany,Philadelphia,PA.Kinsey, A.C., Pomeroy, W.B., Martin, C.E., and Gebhard, P.H. (1953). SexualBehavior in the Human Female. W.B.SaundersCompany,Philadelphia,PA.


8Developing a passion for statisticsBruce G. LindsayDepartment of StatisticsPennsylvania State University, University Park, PAThis chapter covers the major milestones of the early years of my career instatistics. It is really the story of the transitions I made, from early uncertaintyabout my choice of career and my level of talent, up to crossing the tenureline and realizing that I had not only been deemed a success, I had a passionfor the subject. The focus will be on aspects of those adventures that seemmost relevant to young people who are partway along the same journey.8.1 IntroductionI have had a long career in the academic world of statistics, something like40 years. I have seen the whole process from many points of view, includingeight years as a department head. I have supervised 30 PhD students. I wouldhope that from all that experience I might have forged something worthwhileto say here, something not found in a research paper. I have chosen to focuson the early part of my career, as those are the days of major transitions.For those of you early in your career, there are many choices to make as younavigate this world. It starts with the choice of statistics for your education,then a graduate school, then an advisor, then a topic for the thesis, then aplace of employment. This is done while clearing a series of hurdles meant toseparate the qualified from the unqualified, starting with entrance exams andending with tenure. In this essay I will review some of these critical momentsin my early career. With each of these milestones, I gained some wisdom aboutstatistics and myself, and went from being an unsure young man to being apassionate scholar of statistics.Since I joined Penn State in 1979 I have been paid to do cutting edgeresearch that makes my university famous. I therefore have welcomed thisrare opportunity to look backward instead of forward, and think about theroots of my career. One aspect of academic life that has been frustrating to83


84 Passion for statisticsme is its ruthless vitality, always rushing forward, often ending up lookinglike a garden sadly in need of weeding. I wish there were more reflection,more respect for the past. The intellectual rewards, however, have alwaysbeen largest for creativity, for those who till new soil, and so that is wheremost of the energy is spent.And to be fair, I too deserve criticism, as I have too rarely taken on therole of oversight, the role of putting an order to what is important, and sayingwhy. One of my passions is for discovering something new. It is like beingChristopher Columbus, discovering a New World. My discoveries have sometimesinvolved basic understanding of scientific phenomena, but the big magicfor me comes from the beautiful way that statistics, through mathematics, canfind the signals in the midst of noise. In whatever way I can add somethingto this, by discovery of new ways to build models, or compute statistics, orgenerate mathematical understanding of scientific questions: that is part ofwhat makes me feel valuable.However, I also have a passion for mentoring young people. After all, whyelse 30 PhD students? I therefore take on the career counselor role here.Before I describe my early career-changing events, let me touch on a coupleof personal perspectives.There is an important element of philosophy to statistics, epitomized bythe frequentist/Bayesian schism. I was fortunate to be trained by a pair ofpowerful statistical thinkers: Norman Breslow and David Cox. Their frequentistthinking definitely colors my perspective on the philosophy of statistics tothis day. However, I am not passionate about the distinction between Bayesand frequency. Although I am interested in the basic logic behind statistics,it will be a small part of my essay. This will be more about the process ofbecoming excited about the entire statistics culture, and what to do with thatexcitement.I am also someone who has learned to collaborate, and loves it. It is a keypart of maintaining my passion. For the first seven or so years of my career,I only wrote solo papers. About 1985 or so, though, I had an eye-openingresearch discussion with my colleague Clifford Clogg. We were both youngmen then, about at the point of tenure. He had a joint appointment in theDepartments of Statistics and Sociology. In our discussion we realized that we,from very different backgrounds and points of view, had just found the exactsame result by two completely different means (Lindsay et al., 1991). His wascomputational, mine was geometric. He concluded our meeting by saying, withwonder, “I can’t believe that I get paid to do this!” I wholeheartedly agreedwith him, but I must say that a lot of my joy came because I was doing itwith him. We became fast friends and collaborators.Sad to say, Cliff died in 1995, at the age of 45, from a sudden heart attack.It was working with him that I first learned, in a deep sense, that the biggestjoys in statistical work are those that are shared. Nowadays, I often thinkabout problem solving alone, but I very rarely work alone.


B.G. Lindsay 858.2 The first statistical seedsSo now let me set the scene for my first contacts with statistics. I startedout with a mathematics degree from University of Oregon in 1969, but I hadvirtually no statistics training there. I then went to Yale graduate school inmathematics. I only lasted one year because I was drafted into the US militaryin 1970. I had no probability or statistics at Yale, rather a shame given theeminent faculty members there.I took my first basic statistics course while in the US Coast Guard, about1972. It was a night course at Berkeley, taught by an adjunct. Frankly, likemany other “Stat 100’s,” it was not very inspiring. My impression from it wasthat statistics was a collection of strange recipes that had been generated bya foreign culture. Surely this was not mathematics, but what was it?!On the plus side, however, I did my first real statistical analysis duringthose Coast Guard years. I had done poorly in an Armed Services exam thatthe military used to screen applicants to their training schools. The particularexam involved repeatedly looking at two long numbers side by side and sayingif they were identical or not. (I suspect my poor performance on that exam isnow reflected in my poor memory of phone numbers.)As a result of my exam results, I had to get a waiver to get into YeomanSchool. Yeomen are the clerk typists of the Navy and Coast Guard. In the endI did very well in the school, and was convinced that the screening exam wasworthless. And I knew that I would need statistics to prove it! My opportunityarose because I had been assigned to the same Yeoman School as its secretary.I analyzed the school’s data to show that there was zero correlation betweenthe screening exam result and performance in the school. However my letterto the Coast Guard Commandant was never answered.ImustconfessthatatthistimeIwasstillalongwaysfrombeingafanofstatistics. It seemed like a messy version of mathematics constructed from avariety of disconnected black boxes. I could calculate a correlation, and lookup a significance level, but why? The fact that there were multiple ways tomeasure correlation only made it less satisfactory. But the seeds of changehad been planted in me.8.3 Graduate trainingBy the time I left the Coast Guard in 1974, I had decided to drop out of Yaleand out of pure mathematics. I wanted something closer to life on this planet.This led me to switch to graduate school at University of Washington (UW).It was an excellent choice.


86 Passion for statisticsTheir Biomathematics degree offered lots of room for me to explore myapplied mathematical interests. I took courses in ecology, fisheries, epidemiology,and population genetics as I looked about for interesting applications andinteresting applied mathematical areas. Certainly I have no regrets about mysecond choice of graduate schools. It had highly talented faculty and a broadrange of possibilities. I now always recommend that prospective graduate studentsselect schools based on these characteristics.As I took the required Biomathematics courses, I began to find statistics,and its applications, more deeply interesting. In particular, Bob Smythe’smathematical statistics course, where I first saw the magic of maximum likelihood,had me intrigued. And the more courses I took, the more I liked thesubject. It is not a cohesive subject, but it is a powerful one.Averyimportantpartofmyeducationwasnotconventionalcoursework.IwasinaconsultingclassintheCenterforQuantitativeScienceunit,whichmeant sitting in a room and seeing clients. My very positive experience therehas left me a proponent of graduate consulting classes all my life.One of my clients brought in a fisheries problem that did not seem to fitany of the traditional models we had learned in applied classes. If it was notregression or ANOVA or discrete data, what could it be?Salmon are fish with a complex life cycle. It starts when they are born in ahome river, but they soon leave this river to mature in the open ocean. Theyreturn to their home river at the end of their lives in order to lay and fertilizeeggs and die. The set of fish coming from a single river are thus a distinctsubpopulation with its own genetic identity. This was important in fisheriesmanagement, as many of the fish were caught in the open ocean, but withsome diagnostic measurements, one could learn about the river of origin.In the problem I was asked to consult upon, the salmon were being caughtin the Puget Sound, a waterway that connects to both American and Canadianrivers. Since each country managed its own stock of fish, the fisheries managerswanted to know how many of the caught fish came from “Canadian” riversand how many “American.”The data consisted of electrophoretic measurements, an early form of DNAanalysis, made on a sample of fish. It was important that the salmon fromvarious rivers were physically mixed together in this sample. It was also importantthat the scientists also had previously determined the genetic profileof salmon from each river system. However, these genetic “fingerprints” didnot provide a 100% correct diagnosis of the river that each fish came from.That would have made the problem a simple one of decoding the identities,and labelling the salmon, creating a multinomial problem.I now know that a very natural way to analyze such data is to build anappropriate mixture model, and then use maximum likelihood or a Bayesiansolution. At the time, having never seen a mixture model in my coursework,Iwasquitecluelessaboutwhattodo.However,Ididmyduediligenceasa consultant, talked to a population geneticist Joseph Felsenstein, and founda relevant article in the wider genetics literature — it had a similar struc-


B.G. Lindsay 87ture. In the so-called admixture problem, the goal was to determine the racialcomponents of a mixed human population; see, e.g., Wang (2003).This whole process was a great discovery for me. I found the ability ofstatistics to ferret out the hidden information (the home rivers of each salmon)to be the most fascinating thing I had seen to date. It was done with themagic of likelihood. The fact that I could find the methods on my own, inthe literature, and make some sense of them, also gave me some confidence.My lesson was learned, statistics could be empowering. And I had ignited apassion.In retrospect, I was also in the right place at the right time. As I nowlook back at the fisheries literature, I now see that the scientific team thatapproached me with this problem was doing very cutting edge research. Thedecade of the 80s saw considerable development of maximum likelihood methodsfor unscrambling mixed populations. Indeed, I later collaborated on apaper that identified in detail the nature of the maximum likelihood solutions,as well as the identifiability issues involved (Roeder et al., 1989). In thiscase, the application was a different biological problem involving plants andthe fertility of male plants as it depended on the distance from the femaleplants. In fact, there are many interesting applications of this model.A consultant needs to do more than identify a model, he or she also needs toprovide an algorithm for computation. In providing a solution to this problem,I also first discovered the EM algorithm in the literature. Mind you, thereexisted no algorithm called the “EM” until 1977, a year or two after thisproject (Dempster et al., 1977). But like many other discoveries in statistics,there were many prequels. The version I provided to the client was called the“gene-counting” algorithm, but the central idea of the EM, filling in missingdata by expectation, was already there (Ott, 1977).This algorithm became its own source of fascination to me. How and whydid it work? Since that period the EM algorithm has become a powerful toolfor unlocking hidden structures in many areas of statistics, and I was fortunateto be an early user, advocate, and researcher. Its key feature is its reliability incomplex settings, situations where other methods are likely to fail. WheneverI teach a mixture models course, one of my first homework assignments is forthe student to understand and program the EM algorithm.So there you have it. Through my choice of graduate education, by takingaconsultingclass,andbydrawingtherightconsultingclient,Ihadenteredatan early stage into the arenas of mixture models and the EM algorithm, bothof which were to display considerable growth for the next thirty years. I gotin on the ground floor, so to speak. I think the message for young people isto be open to new ideas, and be ready to head in surprising directions, evenif they are not popular or well known.In many ways the growth of mixture models and the EM algorithm camefrom a shift in computing power. As an old-timer, I feel some obligation to offerhere a brief side discussion on the history of computing in statistics during


88 Passion for statisticsthe first years of my career. There was a revolution underway, and it was tochange completely how we thought about “feasible” ways to do statistics.As an undergraduate, I had used punch cards for computing in my computerscience course. All computing on campus was done through a single“mainframe” computer. First one would “punch” the desired program ontocards, one program line being one card, on a special machine. One would thentake this pile of cards to a desk in the computer center for submission. Sometimelater one would get the output, usually with some errors that neededfixing. Turnaround to a completed successful program was very, very slow.Computing was still in the “punch card” era when I went to graduateschool at UW in 1974. However, during my years there, the shift to “terminals”and personal computing was starting. It was so much more attractive that thefocus on mainframe computing quickly faded out. At that point, efficiencyin programming was proving to be much more valuable than speed of themachine.This efficiency had immediate benefits in statistics. Previously there hadbeen a great emphasis on methods that could be computed explicitly. I happened,by chance, to be an observer at one of the most important events of thenew era. I attended a meeting in Seattle in 1979 when Bradley Efron gave onethe first talks on the bootstrap, a subject that blossomed in the 1980s (Efron,1979). Statistics was waking up to the idea that computing power might drivearevolutioninmethodology.The Bayesian revolution was to come a little later. In 1986 I attended anNSF–CBMS workshop by Adrian Smith on Bayesian methods — he gave alot of emphasis to techniques for numerical integration, but the upper limitwas seven dimensions, as I recall (Naylor and Smith, 1982). All this hardwork on integration techniques was to be swept away in the 90s by MCMCmethods (Smith and Gelfand, 1992). This created a revolution in access toBayes, although most of the rigor of error bounds was lost. In my mind, theanswers are a bit fuzzy, but then so is much of statistics based on asymptotics.8.4 The PhDReturning to my graduate education, after two years my exams were takenand passed, and research about to begin. Like most young people I startedby examining projects that interested my possible PhD mentors. The UWBiomathematics program was tremendously rich in opportunities for choosinga thesis direction. It was clear to me that I preferred statistics over the otherpossible biomathematics areas. In addition, it was clear from my qualifyingexam results that I was much better at mathematical statistics than appliedstatistics. I liked the scientific relevance of applied statistics, but also I feltmore of a research curiosity about mathematical statistics.


B.G. Lindsay 89My first PhD investigation was with Ron Pyke of the UW math department.The subject related to two-dimensional Brownian motion, but it generatedlittle fascination in me — it was too remote from applications. I thereforehad to go through the awkward process of “breaking off” with Ron. Ever sincethen I have always told students wishing to do research with me that theyshould not be embarrassed about changing advisors to suit their own interests.The best applied people at UW were in the Biostatistics group, and aftertalking to several of them, I ended up doing my dissertation with NormBreslow, who also had a pretty strong theoretical bent and a Stanford degree.My first contact with Norm was not particularly auspicious. Norm taught acourse in linear models in about 1975 that had the whole class bewildered. Thetextbook was by Searle (2012), which was, in that edition, doggedly matrixoriented. However, for the lectures Norm used material from his graduate daysat Stanford, which involved new ideas from Charles Stein about “coordinatefree” analysis using projections. The mismatch of textbook and lecture wasutterly baffling to all the students — go ask anyone in my class. I think we allflunked the first exam. But I dug in, bought and studied the book by Scheffé(1999), which was also notoriously difficult. In the end I liked the subjectmatter and learned quite a bit about how to learn on my own. I am sure thegeometric emphasis I learned there played a later role in my development ofgeometric methods in likelihood analyses.Fortunately my second class with Norm, on categorical variables, wentrather better, and that was where I learned he was quite involved in epidemiologyand cancer studies. I later learned that his father Lester Breslow wasalso something of a celebrity in science, being Dean of the School of PublicHealth at UCLA.Although he was an Associate Professor, my years of military service meantthat Norm was only a few years older than me. He had already made a name forhimself, although I knew nothing about that. I found him inspiring through hisstatistical talent and biological knowledge, but mainly his passion for statistics.I sometimes wonder if there was not some additional attraction becauseof his youth. The mathematics genealogy website shows me to be his secondPhD student. Going up my family tree, Norm Breslow was Brad Efron’s firststudent, and Brad Efron was Rupert Millers’ second. Going back yet further,Rupert Miller was fifth of Samuel Karlin’s 43 offspring. Going down the tree,Kathryn Roeder, a COPSS award winner, was my first PhD student. This“first-born” phenomenon seems like more than chance. At least in my line ofdescent, youthful passion and creativity created some sort of mutual attractionbetween student and advisor.One of the most important challenges of graduate life is settling on aresearch topic with the advisor. This will, after all, set the direction for yourcareer, if you go into research. I would like to discuss my graduate experiencein some detail here because of combination of chance and risk taking that aresearch career entails.


90 Passion for statisticsIn my experience, most students pick from a small list of suggestions bytheir chosen advisor. Indeed, that is the way I started my research. Norm hadsuggested the development of a sequential testing method based on partiallikelihood (Cox, 1975). I am sure Norm foresaw applications in clinical trials.He probably had some confidence I could do the hard math involved.Being a conscientious student, I started off on this problem but I was verysoon sidetracked onto related problems. I am not sure if I would recommendthis lack of focus to others. However, I do think you have to love what youare doing, as there is no other way to succeed in the long run. I did not lovethe problem he gave me. It seemed to be technically difficult without beingdeep, and not really a chance to grow intellectually.In the end I wrote a thesis on my own topic, about three steps removedfrom Norm Breslow’s proposed topic. I started by studying partial likelihood,which was then a very hot topic. But as I read the papers I asked myself this—whatisthejustificationforusingthispartiallikelihoodthingbeyonditsease of computation? A better focused student would have hewn to Norm’soriginal suggestion, but I was already falling off the track. Too many questionsin my mind.I assure you my independence was not derived from high confidence. Onthe contrary, no matter my age, I have always felt inferior to the best of mypeers. But I am also not good at following the lead of others — I guess I likemarching to the beat of my own drummer. It does not guarantee externalsuccess, but it gives me internal rewards.One risk with research on a hot topic is that you will be scooped. As itturns out, the justification, on an efficiency basis, for Cox’s partial likelihoodin the proportional hazards model was on Brad Efron’s research plate aboutthat time. So it was a lucky thing for me that I had already moved on to anolder and quieter topic.The reason was that the more I read about the efficiency of likelihoodmethods, the less that I felt like I understood the answers being given. Itall started with the classic paper by Neyman and Scott (1948) which demonstratedsevere issues with maximum likelihood when there were many nuisanceparameters and only a few parameters of interest. I read the papers that followedup on Neyman and Scott, working forward to the current time. I haveto say that I found the results to that time rather unsatisfying, except formodels in which there was a conditional likelihood that could be used.My early exposure to mixture models provided me with a new way tothink about consistency and efficiency in nuisance parameter problems, particularlyas it related to the use of conditional and partial likelihoods. I putthese models in a semiparametric framework, where the nuisance parameterswere themselves drawn from a completely unknown “mixing distribution.” Inretrospect, it seems that no matter how much I evolved in my interests, I wasstill drawing strength from that 1975 consulting project.My research was mostly self-directed because I had wandered away fromNorm’s proposed topic. I would report to Norm what I was working on, and


B.G. Lindsay 91he would humor me. Kung Yee Liang, who followed me as a Breslow student,once told me that Norm had asked him to read my thesis and explain it to him.I tell this story not because I advise students to follow this kind of independentcourse. The fact that Norm barely understood what I was doing andwhere I was headed was something of a handicap to me. I spent years figuringout the relevant literature. The real problem is that a graduate studentjust does not know the background, has not seen the talks, and cannot knowwhether the statistical community will think the research is important. I wastaking serious risks, and consider myself fortunate that things worked out inthe end.After settling on a topic and doing the research, another big hurdle mustbe crossed. I must say that at the beginning I found it very difficult to writeup my statistical research, and so I am very sympathetic to my PhD studentswhen they struggle with the organization, the motivation, the background,and reporting the results. I keep telling them that they are simply telling astory, just like they do when they give an oral presentation. If you can givea good talk, you can write a good paper. At Penn State these days manyof the graduate students give multiple talks at meetings and in classes. I amsure this must help them immensely when it comes to writing and defendingtheir dissertations, and going on job interviews. I had no such opportunitiesat Washington, and I am sure it showed in my early writing.At any rate, Norm returned my thesis drafts with lots of red marks. In thebeginning I felt like I had failed, but then bit by bit my writing became clearerand more fitting to the statistical norm. I still had a problem with figuring outthe distinction between what was important and what was merely interesting.In the end, I wrote a very long thesis titled “Efficiency in the presence ofnuisance parameters.” It was a long ways from being publishable.I must say that in those days proper editing was very difficult. I wouldstart the process by turning my handwritten drafts over to a mathematicaltypist. In those days mathematical results were usually typed on an IBMSelectric typewriter. There were little interchangeable balls that had the varioussymbols and fonts. Typing in math meant stopping to change the ball,maybe more than once for each equation. This slow typing system very distinctlydiscouraged editing manuscripts. Mistakes could mean redoing entirepages, and revisions could mean retyping the whole manuscript. Changes inan introduction would alter everything thereafter.Thank goodness those days also disappeared with the advent of personalcomputing. Now one can spend time refining a manuscript without retypingit, and I am sure we have all benefited from the chance to polish our work.At last the finish line was reached, my thesis submitted and approvedin 1978.


92 Passion for statistics8.5 Job and postdoc huntingI was ready to move on, a bright-eyed 31 year old. I had enjoyed doing research,and having received positive feedback about it, I was ready to try the academicjob market. I submitted applications to many schools. While I would havepreferred to stay on the West Coast, the list of job opportunities seemedpretty limiting. With Norm’s encouragement, I also applied for an NSF–NATOpostdoctoral fellowship.To my somewhat shocked surprise, I had six job interviews. I can only inferthat I must have had some good letters from well known scholars. In the endI had interviews at UC Berkeley, UC Davis, Purdue, Florida State, Princeton,and Penn State. I was quite awestruck about being paid to fly around thecountry, as at that point in my life I had never flown anywhere. It was rathernice to be treated like a celebrity for a couple of months.Of course, the interviews could sometimes be intimidating. One colorfulcharacter was Herman Rubin of Purdue, who was notorious for crossexaminingjob candidates in his office. At the end of my seminar, he raisedhis hand and stated that my results could not possibly be correct. It was abit disconcerting. Another place that was frightening, mostly by the fame ofits scholars, was Berkeley. Peter Bickel, not much older than I, was alreadyDepartment Head there. Another place with some intellectual firepower wasPrinceton, where John Tukey sat in the audience.Wherever I did an interview, I told the school that I would like to takethe NATO postdoc if it became available, and that being able to do so wouldbe a factor in my decision. In the end, several did make me offers with anopen start date, and after considerable deliberation, and several coin tosses,I accepted Penn State’s offer over Princeton’s. Since the Princeton departmentclosed soon thereafter, I guess I was right.8.6 The postdoc yearsIn the end, I did garner the postdoc. With it, I went to Imperial College inLondon for 1978–79, where my supervisor was the famous Sir David Cox. Hispaper (Cox, 1972) was already on its way to being one of the most cited worksever (Ryan and Woodall, 2005). My thanks to Norm for opening this door. Allthose early career choices, like University of Washington and Breslow, werepaying off with new opportunities.In London I went back to work on my dissertation topic. When I hadvisited Berkeley, I learned that Peter Bickel had a student working on relatedproblems, and Peter pointed out some disadvantages about my approach toasymptotics. I had drawn heavily on a monograph by Bahadur (1971). Neither


B.G. Lindsay 93I nor my advisors knew anything about the more attractive approaches comingout of Berkeley. So my first agenda in London was to repair some of my workbefore submitting it.It was extremely inspiring to be around David Cox for a year. AlthoughDavid seemed to fall asleep in every seminar, his grasp of the problems peoplewere working on, as evidenced by his piercing questions and comments, wasastounding. He single-handedly ran Biometrika. He was the master of thewhole statistical domain.David was very kind to me, even though he did not have a lot to sayto me about my research. I think it was not really his cup of tea. He did,however, make one key link for me. He suggested that I read the paper byKiefer and Wolfowitz (1956) about a consistent method of estimation for theNeyman–Scott problem. That paper was soon to pull me into the nascentworld of nonparametric mixture modelling. An article by Laird (1978) hadjust appeared, unbeknownst to me, but for the most part the subject hadbeen dead since the Kiefer and Wolfowitz paper.My postdoc year was great. It had all the freedom of being a grad student,but with more status and knowledge. After a great year in London, I returnedto the US and Penn State, ready to start on my tenure track job.8.7 Starting on the tenure trackGoing back to my early career, I confess that I was not sure that I was up tothe high pressure world of publish-or-perish, get tenure or move on. In fact,I was terrified. I worried about my mental health and about my ability tosucceed. I am sure that I was not the first or last to feel these uncertainties,and have many times talked in sympathy with people on the tenure track, atPenn State and elsewhere.It took a number of years for the thesis research to end its stumblingnature, and crystallize. I published several papers in The Annals of Statisticsthat would later be viewed as early work on efficiency in semiparametricmodels. I also made some contributions to the problem of estimating a mixingdistribution nonparametrically by maximum likelihood. In the processI learned a great deal about convex optimization and inverse problems.I often tell my students that I had plenty of doubts about my success whenIwasanAssistantProfessor.Mypublicationlistwastooshort.Thefirstpaperfrom my 1978 thesis was not written until 1979, and appeared in 1980. It wasnot even a statistics journal, it was The Philosophical Transactions of theRoyal Society of London (Lindsay, 1980).My first ten papers were solo-authored, so I was definitely flying on myown. It must have been a kinder era, as I don’t recall any rejections in that


94 Passion for statisticsperiod. And that certainly helps with confidence. (You might find it reassuringto know that I have had plenty of rejections since.)IhadsixpapersonmyCVwhenIcameupfortenurein1984–85.AndIhadpublishednothingatallin1984.Iknowwhy:Ihadspentmostofayear trying to build a new EM theory, and then giving up on it. I was abit scared. Nowadays, and even then, my number would be considered belowaverage. Indeed, many of our recent Assistant Professor candidates at PennState University seem to have had that many when they arrived at PennState. Just the same, the people who wrote the external letters for me werevery supportive, and I was promoted with tenure.Back then it was still not obvious to me that statistics was the right placefor me. My wife Laura likes to remind me that some time in the 1980s I tolda graduate student that “Statistics is dead.” I can understand why I saidit. The major conceptual and philosophical foundations of statistics, thingslike likelihood, Bayes, hypothesis testing, multivariate analysis, robustness,and more, had already been developed and investigated. A few generationsof ingenious thought had turned statistics into an academic subject in itsown right, complete with Departments of Statistics. But that highly energeticcreative era seemed to be over. Some things had already fossilized, and theacademic game contained many who rejected new or competing points of view.It seemed that the mathematical-based research of the 1970s and 1980s had inlarge part moved on to a refinement of ideas rather than fundamentally newconcepts.But the granting of tenure liberated me from most of these doubts. I nowhad a seal of approval on my research. With this new confidence, I realizedthat, in a larger sense, statistics was hardly dead. In retrospect, I should havebeen celebrating my participation in a subject that, relative to many sciences,was a newborn baby. In particular, the computer and data revolutions wereabout to create big new and interesting challenges. Indeed, I think statisticsis much livelier today than it was in my green age. It is still a good place fordiscovery, and subject worthy of passion.For example, multivariate analysis is now on steroids, probing ever deeperinto the mysteries of high-dimensional data analysis, big p and little n, andmore. New techniques, new thinking, and new theory are arising hand inhand. Computational challenges that arise from complex models and enormousdata abound, and are often demanding new paradigms for inference. This isexciting! I hope my experiences have shed some light on your own passionatepursuits in the new statistics.


B.G. Lindsay 95ReferencesBahadur, R.R. (1971). Some Limit Theorems in Statistics. SocietyforIndustrialand Applied Mathematics, Philadelphia, PA.Breslow, N.E. and Day, N.E. (1980). Statistical Methods in Cancer Research.Vol. 1. The Analysis of Case-control Studies, vol. 1. Distributed for IARCby the World Health Organization, Geneva, Switzerland.Cox, D.R. (1972). Regression models and life-tables (with discussion). Journalof the Royal Statistical Society, Series B, 34:187–220.Cox, D.R. (1975). Partial likelihood. Biometrika, 62:269–276.Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihoodfrom incomplete data via the EM algorithm. Journal of the Royal StatisticalSociety, Series B, 39:1–38.Efron, B. (1977). The efficiency of Cox’s likelihood function for censoreddata. Journal of the American Statistical Association, 72:557–565.Efron, B. (1979). Bootstrap methods: Another look at the jackknife. TheAnnals of Statistics, 7:1–26.Kiefer, J. and Wolfowitz, J. (1956). Consistency of the maximum likelihoodestimator in the presence of infinitely many incidental parameters. TheAnnals of Mathematical Statistics, 27:887–906.Laird, N.M. (1978). Nonparametric maximum likelihood estimation of a mixingdistribution. Journal of the American Statistical Association, 73:805–811.Lindsay, B.G. (1980). Nuisance parameters, mixture models, and the efficiencyof partial likelihood estimators. Philosophical Transactions of theRoyal Society of London, Series A, 296:639.Lindsay, B.G. (1983). The geometry of mixture likelihoods: A general theory.The Annals of Statistics, 11:86–94.Lindsay, B.G., Clogg, C.C., and Grego, J. (1991). Semiparametric estimationin the Rasch model and related exponential response models, includingasimplelatentclassmodelforitemanalysis.Journal of the AmericanStatistical Association, 86:96–107.Naylor, J.C. and Smith, A.F.M. (1982). Applications of a method for theefficient computation of posterior distributions. Applied Statistics,31:214–225.


96 Passion for statisticsNeyman, J. and Scott, E.L. (1948). Consistent estimates based on partiallyconsistent observations. Econometrica, 16:1–32.Ott, J. (1977). Counting methods (EM algorithm) in human pedigree analysis:Linkage and segregation analysis. Annals of Human Genetics, 40:443–454.Roeder, K., Devlin, B., and Lindsay, B.G. (1989). Application of maximumlikelihood methods to population genetic data for the estimation of individualfertilities. Biometrics, 45:363–379.Ryan, T.P. and Woodall, W.H. (2005). The most-cited statistical papers.Journal of Applied Statistics, 32:461–474.Scheffé, H. (1999). The Analysis of Variance. Wiley,NewYork.Searle, S.R. (2012). Linear Models. Wiley,NewYork.Smith, A.F.M. and Gelfand, A.E. (1992). Bayesian statistics without tears:Asampling-resamplingperspective.The American Statistician, 46:84–88.Wang, J. (2003). Maximum-likelihood estimation of admixture proportionsfrom genetic data. Genetics, 164:747–765.


9Reflections on a statistical career and theirimplicationsR. Dennis CookSchool of StatisticsUniversity of Minnesota, Minneapolis, MNThis chapter recounts the events that steered me to a career in statisticsand describes how my research and statistical temperament were set by myinvolvement in various applications. The discussion encompasses the historicaland contemporary role of statistical diagnostics in practice and reflections onthe importance of applications in the professional life of a statistician.9.1 Early yearsIt was mostly serendipity that led me to a career in statistics.My introduction to statistics started between my Sophomore and Junioryears in high school. At the time I was looking for summer employment soI could earn money to customize my car — neat cars elevated your social standingand attracted the girls. I was fortunate to secure summer and eventuallyafter-school employment with the Agronomy Department at Fort Assiniboine,an agriculture experimentation facility located just outside Havre, Montana.The surrounding area is largely devoted to wheat production, thousands andthousands of acres of spring and winter wheat. The overarching goal of theAgronomy Department was to develop contour maps of suggested fertilizationregimes for use by wheat farmers along the High Line, a run of about130 miles between Havre and Cut Bank, Montana, and to occasionally developtargeted recommendations for specific tracts of land at the request ofindividual farmers.I continued to work at Fort Assiniboine until I graduated from high school,at which point I enlisted in the military to avoid the uncertainty of the draft.After fulfilling my military obligation, which ended just before the buildup tothe Vietnam war, I returned to full-time employment at Fort Assiniboine while97


98 Reflections on a statistical careerpursuing an undergraduate degree at Northern Montana College. In order tomeet the needs of the surrounding community, Northern offered four degreeprograms — nursing, education, liberal arts and modern farming methods.Given the choices, I decided that education was my best bet, although I wasambivalent about a speciality. While in the military I developed a strong distastefor standing in line, so on the first day of registration when I encounteredlong lines everywhere except for mathematics education, my choice was clear.I continued to work at Fort Assiniboine for four more years until I completedmy undergraduate degree in mathematics education with a minor in biology.My duties during the seven years of employment at Fort Assiniboine focusedon statistics at one level or another, the same cycle being repeated yearafter year. Starting in the late winter, we would prepare the fertilizer combinationsto be tested in the next cycle and lay out the experimental designson paper. We typically used randomized complete block designs, but had tobe prepared with completely randomized and Latin square designs, since wenever knew what the experimental location was like before arriving with theplanting crew. Split plot and split block designs were also used from time totime. Experimental plots would be planted in the spring (or the late fall inthe case of winter wheat), and tended throughout the summer by keepingthe alleys between the plots free of weeds. Plots were harvested in the fall,followed by threshing and weighing the wheat. Most of the winter was spentconstructing analysis of variance tables with the aid of large desktop Monroecalculators and drawing conclusions prior to the next cycle of experimentation.During my first year or so at Fort Assiniboine, I functioned mostly as ageneral laborer, but by the time I finished high school I had developed anappreciation for the research. I was fortunate that, from the beginning, theDepartment Head, who had a Master’s degree in agronomy with a minorin statistics from a Canadian university, encouraged me to set aside time atwork to read about experimental design and statistical methods. This involvedstudying Snedecor’s text on Statistical Methods and Fisher’s monograph onThe Design of Experiments, in addition to other references. The material camequickly for me, mostly because nearly all that I read corresponded to somethingwe were actually doing. But I recall being a bit baffled by the needto select a significance level and the role of p-values in determining recommendations.The possibility of developing a formal cost function to aid ourrecommendations did not arise until graduate school some years later. My undergraduateeducation certainly helped with the mathematics, but was littlehelp with statistics since the only directly relevant offering was a course inprobability with a cursory treatment of introductory statistics.I was eventually given responsibility for nearly all aspects of the experimentaltrials at the Fort. I had learned to assess the experimental location, lookingfor hollows and moisture gradients, and to select and arrange an appropriatedesign. I learned that mis-entering a number could have costly consequences.A yield of 39 bushels mis-entered as 93 bushels per acre could make a nonsignificantfactor seem highly significant, resulting in an unjustified costly


R.D. Cook 99recommendation to the wheat farmers. It is for this reason that I instituted“parallel computing.” Two of us would sit at adjacent desks and simultaneouslyconstruct analysis of variance tables, checking that our results matchedat each step of the analysis. A mismatch meant that we had to repeat thecalculation in full, since there was no way of recovering what we had entered.We would occasionally lose an experiment at a remote location because thegrain was too wet to harvest during the window of opportunity. In an effort torecover some information, I came up with the idea of estimating the numberof seed heads per foot of row. That operation required only counting and themoisture content of the grain was irrelevant. A few pilot experiments showedthat the count was usefully correlated with the grain weight, so we were ableto gain some information from experiments that would be otherwise lost.During my final year at Northern Montana College, I was required to spendsix months student teaching at the local high school where I taught junior algebraand sophomore biology. I found actual teaching quite rewarding, but myoverall experience was a disappointment because colorless non-teaching dutiesdominated my days. The Department Head at the Fort had been encouragingme to pursue a graduate degree in statistics or perhaps mathematics and, aftermy experience student teaching, I decided to follow his advice. I applied tofour universities, two in mathematics and two in statistics, that did not requireafeetoprocessmyapplicationbecausefinanceswereextremelytight.Mydecisionrule was to accept the first that offered a fellowship or assistantship.The following fall I began my graduate studies in statistics at Kansas StateUniversity, aided by a traineeship from the National Institutes of Health andsubsequently a fellowship under the National Defense Education Act, whichwas enacted in response to the Soviet Union’s successful launch of Sputnikand President Kennedy’s moon initiative.Although my degree from Kansas State was in statistics, my dissertationwas in genetics; it was entitled “The Dynamics of Finite Populations: The Effectsof Variable Selection Intensity and Population Size on the Expected Timeto Fixation and the Ultimate Probability of Fixation of an Allele.” I enjoyedseeing genetic theory in action and many hours were spent during my graduatecareer conducting laboratory experiments with Drosophila melanogaster.My first paper was on Bayes’ estimators of gene frequencies in natural populations.My background and fellowships enabled me to complete my PhD degreein three years, at which point I joined the then nascent School of Statisticsat the University of Minnesota, with an appointment consisting of intramuralconsulting, teaching and research, in roughly equal proportions. I continuedmy genetics research for about four years until I had achieved tenure and thenbegan a transition to statistical research, which was largely stimulated andguided by my consulting experiences.


100 Reflections on a statistical career9.2 Statistical diagnosticsIn his path-breaking 1922 paper “On the mathematical foundations of theoreticalstatistics,” R.A. Fisher established the contemporary role of a statisticalmodel and anticipated the development of diagnostic methods for model assessmentand improvement. Diagnostics was a particularly active research areafrom the time of Fisher’s death in 1962 until the late 1980s, and the area isnow an essential ingredient in Fisher modeling.9.2.1 Influence diagnosticsMy involvement with diagnostics began early in my career at Minnesota. Acolleague from the Animal Science Department asked me to review a regressionbecause his experiment had apparently produced results that were diametricallyopposed to his prior expectation. The experiment consisted of injectinga number of rats with varying doses of a drug and then measuring the fractionof the doses, which were the responses, that were absorbed by the rats’ livers.The predictors were various measurements on the rats plus the actual dose.Iredidhiscalculations,lookedatresidualplotsandperformedafewotherchecks that were standard for the time. This confirmed his results, leading tothe possibilities that either there was something wrong with the experiment,which he denied, or his prior expectations were off. All in all, this was not ahappy outcome for either of us.Isubsequentlydecidedtouseasubsetofthedataforillustrationinaregression course that I was teaching at the time. Astonishingly, the selectedsubset of the data produced results that clearly supported my colleague’s priorexpectation and were opposed to those from the full data. This caused someanxiety over the possibility that I had made an error somewhere, but afterconsiderable additional analysis I discovered that the whole issue centeredon one rat. If the rat was excluded, my colleague’s prior expectations weresustained; if the rat was included his expectations were contradicted. Themeasurements on this discordant rat were accurate as far as anyone knew, sothe ball was back in my now quite perplexed colleague’s court.The anxiety that I felt during my exploration of the rat data abated butdid not disappear completely because of the possibility that similar situationshad gone unnoticed in other regressions. There were no methods at the timethat would have identified the impact of the one unusual rat; for example,it was not an outlier as judged by the standard techniques. I decided thatI needed a systematic way of finding such influential observations if they wereto occur in future regressions, and I subsequently developed a method thateasily identified the irreconcilable rat. My colleagues at Minnesota encouragedme to submit my findings for publication (Cook, 1977), which quickly took ona life of their own, eventually becoming known as Cook’s Distance, although


R.D. Cook 101no one sought my acquiescence. In 1982 I coauthored a fairly comprehensiveresearch monograph on the state of diagnostic methods (Cook and Weisberg,1982).Encouraged by the wide acceptance of Cook’s Distance and my other diagnosticcontributions, and aided by a year-long fellowship from the MathematicsResearch Center at the University of Wisconsin, I continued working indiagnostics with the goal of developing local differential geometric measuresthat might detect various influential characteristics of a generic likelihoodbasedanalysis. In 1986 I read before the Royal Statistical Society a paper onalocallikelihood-basedtechniqueforthedevelopmentofdiagnosticstodetectinfluential aspects of an analysis (Cook, 1986).Today models can be and often are much more complicated than thoselikely entertained by Fisher or in common use around the time that I wasearnestly working on influence diagnostics. As a consequence, the methodsdeveloped prior to the 1990s are generally not applicable in more complicatedcontemporary contexts, and yet these contexts are no less affected by influentialobservations. Intricate models are prone to instability and the lack ofproper influence diagnostics can leave a cloud of doubt about the strength ofan analysis. While influence diagnostics have been keeping pace with modeldevelopment largely through a series of important papers by Hongtu Zhu andhis colleagues (Zhu et al., 2007, 2012), methods to address other diagnosticissues, or issues unique to a particular modeling environment, are still laggingfar behind. Personally, I am reluctant to accept findings that are notaccompanied by some understanding of how the data and model interacted toproduce them.9.2.2 Diagnostics more generallyAsubstantialbatteryofdiagnosticmethodsforregressionwasdevelopedduringthe 1970s and 1980s, including transformation diagnostics, various graphicaldiagnostics like residual plots, added variable plots (Cook and Weisberg,1982), partial residual plots and CERES plots for predictor transformations(Cook, 1993), methods for detecting outliers and influential observations, anddiagnostics for heteroscedasticity (Cook and Weisberg, 1983). However, it wasunclear how these methods should be combined in a systematic way to aid ananalysis, particularly since many of them addressed one issue at a time. Forinstance, diagnostics for heteroscedasticity required that the mean functionbe correct, regardless of the fact that an incorrect mean function and homoscedasticerrors can manifest as heteroscedasticity. Box’s paradigm (Box,1980) for model criticism was the most successful of the attempts to bringorder to the application of diagnostic methods and was rapidly adopted bymany in the field. It consists essentially of iteratively improving a model basedon diagnostics: an initial model is posited and fitted to the data, followed byapplications of a battery of diagnostic methods. The model is then modified


102 Reflections on a statistical careerto correct the most serious deficiencies detected, if any. This process is theniterated until the model and data pass the selected diagnostic checks.Diagnostic methods guided by Box’s paradigm can be quite effective whenthe number of predictors is small by today’s standards, say less than 20,but in the early 1990s I began encountering many regressions that had toomany predictors to be addressed comfortably in this way. I once spent severaldays analyzing a data set with 80 predictors (Cook, 1998, p. 296). Box’sparadigm was quite useful and I was pleased with the end result, but the wholeprocess was torturous and not something I would look forward to doing again.Adifferentdiagnosticparadigmwasclearlyneededtodealwithregressionsinvolving a relatively large number of predictors.9.2.3 Sufficient dimension reductionStimulated by John Tukey’s early work on computer graphics and the revolutionin desktop computing, many dynamic graphical techniques were developedin the late 1980s and 1990s, including linking, brushing, scatterplotmatrices, three-dimensional rotation and its extensions to grand tours, interactivesmoothing and plotting with parallel coordinates. My first exposure todynamic graphics came through David Andrews’ Macintosh program calledMcCloud. At one point I thought that these tools might be used effectivelyin the context of diagnostics for regressions with a relatively large number ofpredictors, but that proved not to be so. While dynamic graphical techniquesallow many plots to be viewed in relatively short time, most low-dimensionalprojective views of data can be interesting and ponderable, but at the sametime do not necessarily provide useful information about the higher dimensionaldata. In regression for example, two-dimensional plots of the responseagainst various one-dimensional projections of the predictors can be interestingas individual univariate regressions but do not necessarily provide usefulinformation about the overarching multiple regression employing all predictorssimultaneously. Many projective views of a regression seen in short timecan quickly become imponderable, leaving the viewer with an array of disconnectedfacts about marginal regressions but little substantive knowledgeabout the full regression.My foray into dynamic computer graphics was methodologically unproductive,but it did stimulate a modest epiphany in the context of regressionthat is reflected by the following question: Might it be possible to construct alow-dimensional projective view of the data that contains all or nearly all ofthe relevant regression information without the need to pre-specify a parametricmodel? If such a view could be constructed then we may no longer need toinspect many diagnostic plots and Box’s paradigm could be replaced with amuch simpler one, requiring perhaps only a single low-dimensional display asa guide to the regression. Stated more formally, can we find a low-dimensionalsubspace S of the predictor space with the property that the response Y isindependent of the predictor vector X given the projection P S X of X onto S;


R.D. Cook 103that is, Y X|P S X?Subspaceswiththispropertyarecalleddimensionreductionsubspaces. The smallest dimension reduction subspace, defined as theintersection of all dimension reduction subspaces when it is itself a dimensionreduction subspace, is called the central subspace S Y |X (Cook, 1994, 1998).The name “central subspace” was coined by a student during an advancedtopics course in regression that I was teaching in the early 1990s. This areais now widely know as sufficient dimension reduction (SDR) because of thesimilarity between the driving condition Y X|P SY |XX and Fisher’s fundamentalnotion of sufficiency. The name also serves to distinguish it from otherapproaches to dimension reduction.The central subspace turned out to be a very effective construct, and overthe past 20 years much work has been devoted to methods for estimating it; thefirst two methods being sliced inverse regression (Li, 1991) and sliced averagevariance estimation (Cook and Weisberg, 1991). These methods, like nearlyall of the subsequent methods, require the so-called linearity and constant covarianceconditions on the marginal distribution of the predictors. Althoughthese conditions are largely seen as mild, they are essentially uncheckable andthus a constant nag in application. Ma and Zhu (2012) recently took a substantialstep forward by developing a semi-parametric approach that allowsmodifications of previous methods so they no longer depend on these conditions.The fundamental restriction to linear reduction P SY |XX has also beenlong recognized as a limitation. Lee et al. (2013) recently extended the foundationsof sufficient dimension reduction to allow for non-linear reduction. Thisbreakthrough, like that from Ma and Zhu, opens a new frontier in dimensionreduction that promises further significant advances. Although SDR methodswere originally developed as comprehensive graphical diagnostics, they arenow serviceable outside of that context.Technological advances resulted in an abundance of applied regressionsthat Box’s paradigm could no longer handle effectively, and SDR methodswere developed in response to this limitation. But technology does not standstill. While SDR methods can effectively replace Box’s paradigm in regressionswith many predictors, they seem ill suited for high-dimensional regressionswith many tens or hundreds of predictors. Such high-dimensional regressionswere not imagined during the rise of diagnostic or SDR methods, but areprevalent today. We have reached the point where another diagnostic templateis needed.9.2.4 High-dimensional regressionsHigh-dimensional regressions often involve issues that were not common in thepast. For instance, they may come with a sample size n that is smaller thanthe number of predictors p, leadingtothesocalled“n


104 Reflections on a statistical careerOne favored framework imposes a sparsity condition — only a few of themany predictors are relevant for the regression — which reduces the regressiongoal to finding the relevant predictors. This is now typically done by assumingamodelthatis(generalized)linearinthepredictorsandthenestimatingtherelevant predictors by optimizing a penalize objective function. An analysisof a high-dimensional regression based on this approach involves two acts offaith.The first act of faith is that the regression is truly sparse. While there arecontexts where sparsity is a driving concept, some seem to view sparsity asakin to a natural law. If you are faced with a high-dimensional regression thennaturally it must be sparse. Others have seen sparsity as the only recourse. Inthe logic of Bartlett et al. (2004), the bet-on-sparsity principle arose because,to continue the metaphor, there is otherwise little chance of a reasonablepayoff. In contrast, it now seems that reasonable payoffs can be obtained alsoin abundant regressions where many predictors contribute useful informationon the response, and prediction is the ultimate goal (Cook et al., 2012).The second and perhaps more critical act of faith involves believing thedata and initial model are flawless, apart from the statistical variation thatis handled through the objective function. In particular, there are no outliersor influential observations, any curvature in the mean function is capturedadequately by the terms in the model, interactions are largely absent, theresponse and predictors are in compatible scales and the errors have constantvariation. It has long been recognized that regressions infrequently originatein such an Elysian condition, leading directly to the pursuit of diagnosticmethods. I can think of no compelling reason these types of considerationsare less relevant in high-dimensional regressions. Diagnostic methods can andperhaps should be used after elimination of the predictors that are estimatedto be unrelated with the response, but this step alone may be inadequate.Failings of the types listed here will likely have their greatest impact duringrather than after penalized fitting. For instance, penalized fitting will likelyset the coefficient β of a standard normal predictor X to zero when the meanfunction in fact depends on X only through a quadratic term βX 2 .Findingsthat are not accompanied by an understanding of how the data and modelinteracted to produce them should ordinarily be accompanied by a good doseof skepticism.9.3 Optimal experimental designMy interest in an optimal approach to experimental design arose when designinga comprehensive trial to compare poultry diets at six universities. Althoughthe experimental diets came from a common source, the universitieshad different capabilities and facilities which made classical Box–Fisher–Yates


R.D. Cook 105variance reduction designs difficult to employ, particularly since the underlyingnon-linear model called for an unbalanced treatment design.Optimal experimental design was for many years regarded as primarily amathematical subject. While applications were encountered from time to time,it was seen as largely a sidelight. Few would have acknowledged optimal designas having a secure place in statistical practice because the approach was toodependent on knowledge of the model and because computing was often animpediment to all but the most straightforward applications. During the 1970sand most of the 1980s, I was occasionally a party to vigorous debates on therelative merits of classical design versus optimal design, pejoratively referred toby some as “alphabetic design” in reference to the rather unimaginative designdesignations like D-, A- and G-optimality. Today classical and optimal designare no longer typically seen as distinct approaches and the debate has largelyabated. The beginning of this coalescence can be traced back to technologicaladvances in computing and to the rise of unbalanced experimental settingsthat were not amenable to classical design (Cook and Nachtsheim, 1980, 1989).9.4 Enjoying statistical practiceStatistics has its tiresome aspects, to be sure, but for me the practice ofstatistics has also been the source of considerable pleasure and satisfaction,and from time to time it was even thrilling.For several years I was deeply involved with the development of aerial surveymethods. This included survey methods for snow geese on their moltinggrounds near Arviat on the west shore of Hudson Bay, moose in northernMinnesota, deer in southern Manitoba and wild horses near Reno, Nevada.It became apparent early in my involvement with these studies that the developmentof good survey methods required that I be actively involved inthe surveys themselves. This often involved weeks in the field observing andparticipating in the surveys and making modifications on the fly.The moose and deer surveys were conducted in the winter when foliagewas largely absent and the animals stood out against a snowy background.Nevertheless, it soon became clear from my experience that aerial observerswould inevitably miss some animals, leading to underestimation of the populationsize. This visibility bias would be a constant source of uncertaintyunless a statistical method could be developed to adjust the counts. I developeddifferent adjustment methods for moose and deer. Moose occur in herds,and it seemed reasonable to postulate that the probability of seeing an animalis a function of the size of its herd, with solitary animals being missedthe most frequently. Adding a stable distribution for herd size then led to anadjustment method that resulted in estimates of population size that were inqualitative agreement with estimates from other sources (Cook and Martin,


106 Reflections on a statistical career1974). A different adjustment method for deer censuses was developed basedon a design protocol that involved having two observers on the same side ofthe aircraft. The primary observer in the front seat called out and recorded allthe deer that he saw. The secondary observer in the rear seat recorded onlydeer that the primary observer missed. The resulting data plus a few reasonableassumptions on the generation process led directly to adjusted populationcounts (Cook and Jacobson, 1979).A version of mark-capture was developed for estimating population sizesof wild horses. The horses were marked by a tethered shooter leaning out theright side of a helicopter flying near tree-top level above the then runninganimals. The shooter’s demanding task was to use a fancy paint-ball gun tomark the animal on its left rear quarter. I was the primary shooter during thedevelopment phase, and I still recall the thrill when the helicopter pulled upsharply to avoid trees or other obstacles.9.5 A lesson learnedBeginning in my early days at Fort Assiniboine, my statistical perspectivesand research have been driven by applications. My work in diagnostic methodsoriginated with a single rat, and my attitude toward inference and diagnosticswas molded by the persistent finding that plausible initial models often do nothold up when contrasted against the data. The development of SDR methodswas stimulated by the inability of the then standard diagnostic methods todeal effectively with problems involving many variables. And, as mentionedpreviously, we are now at a point where a new diagnostic paradigm is neededto deal with the high-dimensional regressions of today. My interest in optimaldesign arose because of the relative rigidity of classical design. My contributionsto aerial surveys would have been impossible without imbedding myselfin the science. This has taught me a lesson that may seem retrospectivelyobvious but was not so for me prospectively.Statistics is driven by applications which are propelled by technologicaladvances, new data types and new experimental constructs. Statistical theoryand methods must evolve and adapt in response to technological innovationthat give rise to new data-analytic issues. High-dimensional data, which seemsto dominate the pages of contemporary statistics journals, may now be overshadowedby “Big Data,” a tag indicating a data collection so large that itcannot be processed and analyzed with contemporary computational and statisticalmethods. Young statisticians who are eager to leave a mark may oftenfind themselves behind the curve when too far removed from application. Thegreatest statistical advances often come early in the growth of a new area,to be followed by a fleshing out of its nooks and crannies. Immersing oneself


R.D. Cook 107in an application can bring a type of satisfaction that may not otherwise bepossible.ReferencesBartlett, P.L., Bickel, P.J., Bühlmann, P., Freund, Y., Friedman, J., Hastie,T., Jiang, W., Jordan, M.J., Koltchinskii, V., Lugosi, G., McAuliffe, J.D.,Ritov, Y., Rosset, S., Schapire, R.E., Tibshirani, R.J., Vayatis, N., Yu,B., Zhang, T., and Zhu, J. (2004). Discussions of boosting papers. TheAnnals of Statistics, 32:85–134.Box, G.E.P. (1980). Sampling and Bayes’ inference in scientific modellingand robustness (with discussion). Journal of the Royal Statistical Society,Series A, 143:383–430.Cook, R.D. (1977). Detection of influential observation in linear regression.Technometrics,19:15–18.Cook, R.D. (1986). Assessment of local influence (with discussion). Journalof the Royal Statistical Society, Series B, 48:133–169.Cook, R.D. (1993). Exploring partial residual plots. Technometrics,35:351–362.Cook, R.D. (1994). Using dimension-reduction subspaces to identify importantinputs in models of physical systems. In Proceedings of the Section onPhysical Engineering Sciences, American Statistical Association, Washington,DC, pp. 18–25.Cook, R.D. (1998). Regression Graphics. Wiley,NewYork.Cook, R.D., Forzani, L., and Rothman, A.J. (2012). Estimating sufficientreductions of the predictors in abundant high-dimensional regressions.The Annals of Statistics, 40:353–384.Cook, R.D. and Jacobson, J.O. (1979). A design for estimating visibility biasin aerial surveys. Biometrics, 34:735–742.Cook, R.D. and Martin, F. (1974). A model for quadrant sampling with“visibility bias.” Journal of the American Statistical Association, 69:345–349.Cook, R.D. and Nachtsheim, C.J. (1980). A comparison of algorithms forconstructing exact D-optimal designs. Technometrics,22:315–324.Cook, R.D. and Nachtsheim, C.J. (1989). Computer-aided blocking of factorialand response surface designs. Technometrics,31:339–346.


108 Reflections on a statistical careerCook, R.D. and Weisberg, S. (1982). Residuals and Influence in Regression.Chapman & Hall, London.Cook, R.D. and Weisberg, S. (1983). Diagnostics for heteroscedasticity inregression. Biometrika, 70:1–10.Cook, R.D. and Weisberg, S. (1991). Comment on “Sliced inverse regressionfor dimension reduction” (with discussion). Journal of the AmericanStatistical Association, 86:316–342.Lee, K.-Y., Li, B., and Chiaromonte, F. (2013). A general theory for nonlinearsufficient dimension reduction: Formulation and estimation. The Annalsof Statistics, 41:221–249.Li, K.-C. (1991). Sliced inverse regression for dimension reduction (with discussion).Journal of the American Statistical Association, 86:328–332.Ma, Y. and Zhu, L. (2012). A semiparametric approach to dimension reduction.Journal of the American Statistical Association, 107:168–179.Zhu, H., Ibrahim, J.G., and Cho, H. (2012). Perturbation and scaled Cook’sdistance. The Annals of Statistics, 40:785–811.Zhu, H., Ibrahim, J.G., Lee, S., and Zhang, H. (2007). Perturbation selectionand influence measures in local influence analysis. The Annals ofStatistics, 35:2565–2588.


10Science mixes it up with statisticsKathryn RoederDepartment of StatisticsCarnegie Mellon University, Pittsburgh, PAIhavemanypeopletothankforencouragingmetowritethisessay.IndeedI believe I am among the very last group of stragglers to complete the task.My biggest problem was deciding who was the likely audience. My wonderfulthesis advisor, Bruce Lindsay, also an author in this volume, told me to pickmy own audience. So, while I welcome any reader, I hope that the story of mycollaborative work might provide some insights for young researchers.10.1 IntroductionAn early inspiration for my career was a movie shown in an introductorybiology class “The Story of Louis Pasteur.” Paul Muni won an AcademyAward playing Pasteur, the renowned scientist who revolutionized microbiology.Filmed in 1936, the movie was dark, creepy, and melodramatic. Somepeople might have taken inspiration from the contributions Pasteur made tomankind, but what struck me was that he was portrayed as a real person —vain, egotistical, and driven by his ideas. It resonated with me and gave me aglimpse of a future that I could not have imagined when I was growing up onafarminruralKansas.ItprovidedacluethatthecrazyintensityIfeltcouldbe put to good use. My next realization was that while I felt driven, I was nota great scientist. After working for some years as a research assistant, it wasapparent that the life of a mediocre scientist would be dreary indeed; however,I liked the mathematical and statistical stuff the other science majors founddull. And so an academic statistician was born.109


110 Science mixes it up with statistics10.2 CollaboratorsGood collaborators make all the difference, both in terms of happiness andproductivity. But how does one find them? Good collaborators, like goodfriends, cannot be found by direct search. They appear when you are pursuingmutual interests. I’ve been lucky to collaborate with many great statisticalcolleagues, post docs and graduate students. Here I’ll focus on scientific collaborators,because such bonds are not likely to happen by chance. To gainentrance into good scientific collaborations requires a substantial investmentof time and effort. In some cases, even when you are an established researcheryou have to work for several years as part of the research team before youget regular access to the leading scientist on a project. I have spent manyan hour talking with brilliant graduate students and post docs working forleading researchers. This is a great way to participate in big science projects.I find that if I provide them with statistical insights and guidance, they willhelp me with the data. Sharing expertise and effort in this way is productiveand fun.Asuccessfulappliedprojectrequiresgooddataandtypicallyastatisticiancannot produce data on her own. Invariably, scientists have invested years oftheir lives to procure the data to which we want to gain access. Hence, itis traditional for the lab director to be last author, a position of honor, onany papers involving the initial publication of these data. In addition, typicallya post doc or graduate student who has also played a substantial rolein getting the data is the first author of such papers. Because these dataare presented to us electronically, it is easy to forget the tremendous investmentothers have made. We too want to have a leading role for the statisticsteam, and the authorship rules can be frustrating. But I have found that itis immensely worthwhile to participate in such projects and contribute wherepossible. Having made such an investment, it is usually possible to do moreinvolved statistical analysis in a follow-up paper. Naturally, this is where thestatisticians get key authorship roles.Collaboration requires a tremendous amount of sharing and trust, so it isnot surprising that it can be challenging to succeed. Just as good collaborationsbuoy our spirits, bad collaborations wear us down. My mother never wentto college, but she was fond of the maxims of economics: “Time is money” and“Don’t throw good money after bad” were her favorites. Both of these shedlight on the dilemma of what to do about a bad collaboration. None of us likesto invest a lot of effort and get nothing in return, and yet, an unsatisfyingor unhappy collaboration does not ultimately lead to good research. I havehad many experiences where I’ve walked away from a project after makingsubstantial investments of effort. I have never regretted getting out of suchprojects. This leaves more time for other great collaborations.


K. Roeder 11110.3 Some collaborative projectsI started my search for collaborators as a graduate student. Because I hadstudied so much biology and chemistry as an undergraduate, it struck methat all those years of training should not be ignored. I decided to hang outwith the evolutionary biology graduate students, attending their seminars andsocial events. In time I found an opportunity to collaborate on a project. Theresearch involved plant paternity. Little did I know that this would be the firstjoint paper in a long collaborative venture with Bernie Devlin. Years later wemarried and to date we have co-published 76 papers. Since the early yearsBernie’s interests have evolved to human genetics and statistical genetics,dovetailing very nicely with my own. So while I can’t recommend everyonemarry a collaborator, it has benefitted me immensely.Genetic diversity has been a theme for much of our joint research. Toprovide an example of how an initial investment in a research topic can leadfrom one paper to another, I will explain this line of research. But first, topromote understanding of this section, I will provide a very brief primer ongenetics. An allele is an alternative form of DNA, located at a specific positionon a chromosome. The allele frequency is the probability distribution of thealleles among individuals in the population. Frequently this distribution isestimated using a sample of alleles drawn from the reference database. Forexample, sickle cell anemia is due to a single base pair change (A to a T)in the beta-globin gene. The particular allelic form with a T is extremelyrare in Caucasian populations, but more common in African populations. Thereason for this difference in allele frequencies is that the T form provides somebenefit in resisting malaria. Thus the selective advantage of the T allele wouldbe felt more strongly in African populations, causing a shift in frequencydistribution. Finally, to complete the genetics primer, at each location, a pairof alleles is inherited, one from each parent (except, of course, on chromosomeX). By Mendel’s law, a parent passes on half of their genetic material to anoffspring. It is through these simple inheritance rules that numerous geneticrelationships can be inferred (such as paternity).In our first project we needed to infer the paternal source of inheritancefor all the seeds produced by a plant. The maternal source is obvious, becausethe seeds are produced on the maternal plant. While plants don’t pay alimony,paternity is interesting for other reasons. Plants are obviously stationary, butthe paternal genes are transmitted via pollen by natural vectors (butterfliesand such), so genetic material moves much more widely than expected. Itis important to know how far genes move naturally so that we can predictthe consequences of genetic engineering. From a statistical point of view, forplants, paternity is inferred just as in humans. When a child matches thealleged father at half her alleles, then he is not excluded for paternity. And if


112 Science mixes it up with statisticsthe chance of matching these alleles by chance is very low, then paternity isinferred.My work with plant paternity was about sex, but it could not be consideredsexy. I enjoyed this project very much, but it was of interest to a specializedcommunity. It was my next project that attracted the interest of other statisticians.In the late 1980s and early 1990s, DNA forensic inference was in itsinfancy. One of the earliest uses of this technique occurred in England. Basedon circumstantial evidence a boy was accused of the rape and murder of twogirls in his village. During his interrogation he asked for a blood test. As luckwould have it, Alec Jeffries, who had just developed a method of DNA fingerprinting,was located just six miles from the village. The boy’s blood was testedand he was found to be innocent. This was done by comparing the observedalleles at several genetic locations between the DNA left at the crime sceneto the DNA of the suspect. The choice of genetic locations was made so thatthere were a large number of possible alleles at each location. Consequentlythe probability of two people matching by chance was extremely low.This method of garnering evidence was tremendously powerful and as suchwas highly controversial. Genetic evidence had been used in paternity casesfor quite some time, but the impact of using DNA to convict people of seriouscrimes was much more compelling and there was an obvious need for seriousstatistical inquiry. Very early on, our colleague, Professor Neil Risch, wasinvited to work on the problem by a private company, LIFECODES, and alsoby the FBI. Neil invited us into collaboration. This was the beginning of anexciting period of investigation. The three of us published several papers andwere occasionally quoted in the New York Times. Although the setting andthe technology were new, many statistical questions were familiar from thepaternity project. Although we did not set out with a plan to get involved inthis hot topic, we would never have had this opportunity if we hadn’t gainedexpertise in the original botanical project.There were many aspects to this controversy. Here I’ll discuss one issue— the suitability of available reference databases for calculating the probabilityof a match between a suspect and the perpetrator. The question at issuewas how much people vary in their allele frequencies across populations. Forinstance, if an Asian reference sample is available, will it be suitable if the suspectis Korean? Naturally, controversy always rages most keenly when thereare little data available from which to definitively answer the questions. Let’sexamine this one guided by Sewell Wright. If we divide the world up intopopulations (continental groups) and subpopulations (ethnic groups withinacontinent),thenwecanbegintoexaminethisquestion.Wecanpartitionthe variance into various levels, among populations, among subpopulationswithin a population, among individuals within a subpopulation, and finally,layered over all of this is sampling error. Moderate sized samples were immediatelyavailable at the population level and it was apparent that populationsdid not differ strongly. But at the subpopulation level there was considerablesampling error, making it impossible to determine empirically if subpopu-


K. Roeder 113lations varied strongly. Finally, we could clearly see that individuals variedtremendously, at least if we looked at individuals at the population level. Thequestion remained, could individuals in subpopulations be quite similar andhence be falsely accused? I don’t want to rehash the controversy, but sufficeit to say that it was a profitable debate and it was great for statisticians tobe involved on both sides of the argument. By the end of the decade DNAforensics became a model for solid forensic evidence. Not to mention that, inthe meantime, I earned tenure.The impact of variability in allele frequencies also arises when looking forassociations between genetic variants and disease status. Assume two populationsvary in allele frequency and they differ in disease prevalence. If the dataare pooled over populations, there will be an association between the locusand disease that falsely implies that one allelic variant increases the risk ofdisease. Indeed this is Simpson’s paradox. In most applications, confoundingis a frustration about which little can be done beyond taking care not to drawcausal inferences for association studies. However, in the genetic context thereis a fascinating option that allows us to get a little closer to making causal inferences.It all goes back to population genetics where the variability in allelefrequencies is called population substructure.Suppose we plan to test for association at many genetic locations (SNPs)across the genome using a simple χ 2 test for association at each SNP. Relyingon some statistical and population genetic models we can show that in thepresence of population substructure the test statistic approximately followsan inflated χ 2 -distribution. And, under certain reasonable assumptions, theinflation factor is approximately constant across the genome. Consequentlywe can estimate this quantity and determine the severity of the confounding.Based on this principle, the approach we developed is called Genomic Control(GC). Applying GC is such a routine part of most tests of genetic associationthat the original paper is no longer cited.Recently large platforms of SNPs became available as part of genome wideassociation studies, or GWAS. From this immense source of genetic information,a very good proxy was discovered that creates an approximate map ofgenetic ancestry. Using this as a covariate, one can essentially remove the confoundingeffect of population substructure. The map is generated using dimensionreduction techniques such as principal components, or spectral analysis.It is remarkable that while the subtle differences in allele frequency amongSNPs is small enough to allow forensic inference to be sufficiently accurate,and yet when accumulated over a huge number of alleles, the combined effectis quite informative about local ancestry.More recently I’ve been involved in DNA sequence studies to find genesthat increase the risk of autism and other disorders and diseases. This typeof research involves large consortiums. Indeed, some papers have hundreds ofauthors. In big science projects it may seem that there is no place for youngresearchers to play a role, but this is not the case. Almost all of the establishedresearchers in our group are eager to support young researchers. Moreover,


114 Science mixes it up with statisticsthey understand that young researchers need to have primary authorship onthese ventures.10.4 ConclusionsOne theme in much of my work has been a willingness and desire to workon things that are new and somewhat controversial. Not surprisingly, thesetopics often garner more attention than others, and certainly more attentionthan the statistical contributions would merit on their own. I personally findthat such topics excite my curiosity and push me to work more intensely.While working on such topics has the benefit of leading to publication in topjournals, it also has drawbacks. When you’ve been on the opposite side ofan argument, tempers can flare, and in the process grants and papers canbe rejected. The best policy in this regard is to maintain good communicationand try to respect the people on the other side of the argument, evenwhile disagreeing with their opinions. Regarding another controversial topicwe published on — heritability of IQ — we encountered some stiff criticismat a leading journal. Our finding was that the so-called maternal effect hada big impact on the IQ of a child. As a consequence it could be argued thatit was important that a mother experience a good environment during pregnancy,which hardly seems controversial. Nevertheless, a top researcher frompsychometrics said we would damage an entire field of scientific inquiry if wepublished this paper. Fortunately, Professor Eric Lander argued for our paperand we managed to publish it in Nature. Just last year David Brooks, a NewYork Times columnist, mentioned the paper in one of his columns. For me,knowing that an idea could stay alive for nearly two decades and rise to theattention of someone like Brooks was enormously satisfying.While I have been blessed with singular good luck in my life, writing onlyabout the successes of a career can leave the wrong impression. I wouldn’t wantyoung readers to think mine was a life of happy adventures, filled with praise.Just before my first job interview a beloved faculty member told me I was onlyinvited because I was a girl. Nonplussed, I asked him if he could help me choosemy interview shoes. As a young faculty member the undergraduate studentsinsisted on calling me Mrs. Roeder until I finally listed my proper name onthe syllabus as Professor Roeder, offering helpfully — if you have troublepronouncing my last name, you can just call me Professor. More recentlyI had three papers rejected in a month, and one was rejected within fourhours of submission. I believe it took me longer to upload it to the journalweb site than it took them to review it. But truth be told, I have not foundrejections hard to bear. Oddly it is success that can fill me with dread. Theyear I won the COPSS award I was barely able to construct an adequateacceptance speech. Any field that felt I was a winner was surely a sorry field.


K. Roeder 115Since that year (1997) I have had many stretches of time wherein I received nopraise whatsoever. This gave me ample opportunity to regret my lack of joyat such a wonderful gift. Yet, with or without external validation, I continueto feel the greatest happiness when I see that an idea of mine has worked out.It is at these times I still think of Louis Pasteur and his wild passion to trythe rabies vaccine (illegally) and the thrill he must have felt when it worked.So many years later I cannot help but marvel on the randomness of lifeand the paths we take in our careers. That biology professor, probably uplate writing a grant proposal, may have shown the Pasteur movie because hedidn’t have time to prepare a lecture. And in some small way it launched myship. Just thinking about that movie inspired me to look for it on Wikipedia.Imagine my surprise when I discovered that it was a successful Hollywoodventure. Indeed it was nominated for best picture. Had I known that this wasan award-winning movie, I never would have put so much stock in the personalmessage I felt it was conveying to me. And if I hadn’t, would I have had thesatisfying adventures of my career? Life is a mystery.


11Lessons from a twisted career pathJeffrey S. RosenthalDepartment of Statistical SciencesUniversity of Toronto, Toronto, ONIreflectuponmyacademiccareerpaththatultimatelyledtoreceivingtheCOPSS Presidents’ Award, with the hopes of providing lessons and insightsfor younger researchers.11.1 IntroductionOn a chilly Toronto evening in February, 2007, my wife and I returned homefrom a restaurant. My wife went into the kitchen to put some leftovers inthe fridge, while I flopped onto the couch and absent-mindedly picked up alaptop computer to check my email. A minute later my wife heard a dazedand confused “Oh my god!” and rushed back in to see what was wrong. I wasbarely able to mutter that, to my amazement, I had just been selected toreceive that year’s COPSS Presidents’ Award.The email message talked mostly about boring details, like the importanceof my keeping my award “STRICTLY CONFIDENTIAL” until the officialannouncement (over five months later!). And the award’s web page focusedmore on its sponsorship and eligibility requirements than on its actual meaningand value. But none of that mattered to me: I knew full well that this awardwas a biggie, generally regarded as the world’s top academic prize in statistics.Icouldn’tbelievethattheyhadchosenmetoreceiveit.Six years later, I still can’t.I was struck then, as I often am, by my career’s twists and turns: howsome of the most interesting developments were also the least expected, andhow unlikely it would have seemed that I would ever win something like theCOPSS. In fact, I never set out to be a statistician at all.Many young statisticians picture COPSS winners as having clear, linearcareer paths, in which their statistical success always appeared certain. Inmy case, nothing could be further from the truth. So, in this chapter, I will117


118 Lessons from a twisted career pathreflect upon some of the twists and turns of my academic career to date, withthe hopes of providing lessons and insights (written in italics) foryoungerresearchers.11.2 Student daysI was an undergraduate student at the University of Toronto from 1984 to1988. What I remember most from those years is the huge excitement thatI felt at being surrounded by so much knowledge and learning. I would run enthusiasticallyto lectures and meetings, unable to wait for what I would learnnext. In addition to my regular classes, I took or audited courses in othersubjects of interest (astronomy, chemistry, philosophy, linguistics), joined variousclubs and activities, socialized a great deal, played music with friends,developed my spoken French, went on fantastic camping and canoeing trips,and discussed everything with everyone. Around that time, a high school acquaintance(in fact, the young lady that I had taken to my high school prom)remarked that she saw me on campus from time to time, but never managedto talk to me, since I was always rushing off to somewhere else.Subsequent years of pressure and deadlines have somewhat dulled thatinitial sense of excitement, but I can still feel and remember it well, and it hascarried me through many difficult times. Indeed, if I could give just one pieceof advice to students and young academics, it would be this: maintain yourenthusiasm about learning as much as you can about everything. With enoughexcitement and passion, everything else will follow.In my undergraduate studies, I concentrated primarily on pure mathematicsand physics, with some computer science on the side. You will notice that“statistics” has not been mentioned here. Indeed, I am aCOPSSwinnerwhonever took a single statistics course. Idid,however,benefittremendouslyfromthe rigorous mathematical training that I received instead.11.2.1 Applying to graduate schoolWhen my undergraduate studies were coming to an end, I was excited to applyto graduate programs. All around me, students were rambling on about beingunsure what they wanted to study or what they would do next. I scoffed atthem, since I already “knew” what I wanted to study: mathematical analysiswith applications to physics! (Statistics never even crossed my mind.)Despite my successful undergraduate years, I fretted enormously over mygrad school applications, applying to loads of programs, wondering what myprofessors would write about me, thinking I wouldn’t get accepted, and so on.That’s right: even future COPSS winners worry about succeeding in academics.


J.S. Rosenthal 119My math professors advised me that, while there were many good mathematicsgraduate programs, the best one was at Princeton University. So, I wasamazed and delighted to receive a letter accepting me into their PhD program!They even offered a bit of money to help me visit their campus before deciding.So, although I “knew” that I was planning to accept their offer, I foundmyself on a flight to Newark to visit the famous Princeton campus.And then a funny thing happened. My visit made me very depressed. Itdid reinforce the amazing research depth of the Princeton math faculty. Butnone of the PhD students there seemed happy. They felt a lot of pressure towrite very deep doctoral theses, and to finish in four years. They admittedthat there wasn’t much to “do” at Princeton, and that everyone spent all theirtime on work with little time for fun. (I asked one of them if there were clubsto go hear music, but they didn’t seem to even understand my question.)IreturnedtoTorontofeelingworriedaboutmychoice,andfearingthatI might be miserable at Princeton. At the same time, I wondered, did itreally make sense to consider such intangible factors when making importantacademic decisions? I finally decided that the answer was yes, and I stand bythat conclusion today: it is perfectly reasonable to balance personal preferencesagainst academic priorities.So, I decided to consider other graduate schools too. After some more traveland much agonizing, I enrolled in the Harvard University Mathematics PhDProgram. Harvard also had incredible mathematical research depth, includingin mathematical physics, and in addition it was in a fun-seeming city (Boston)with students who seemed to find at least a bit of time to enjoy themselves.I had made a decision. I had even, I think, made the right decision. Unfortunately,I wasn’t sure I had made the right decision. Now, it should beobvious that: once you have made a decision, stick with it and move on; don’twaste time and effort worrying about whether it was correct. But I didn’t followthat advice. For several years, I worried constantly, and absurdly, aboutwhether I should have gone to Princeton instead.11.2.2 Graduate school beginningsAnd so it was that I began my PhD in the Harvard Mathematics Department.I struggled with advanced mathematics courses about strange-seemingabstract algebraic and geometric concepts, while auditing a physics courseabout the confusing world of quantum field theory. It was difficult, and stressful,but exciting too.My first big challenge was the PhD program’s comprehensive examination.It was written over three different afternoons, and consisted of difficultquestions about advanced mathematical concepts. New PhD students wereencouraged to take it “on a trial basis” just months after beginning their program.I did my best, and after three grueling days I thought I was probably“close” to the passing line. The next week I nervously went to the graduatesecretaries’ office to learn my result. When she told me that I passed (uncon-


120 Lessons from a twisted career pathditionally), I was so thrilled and amazed that I jumped up and down, pattedvarious office staff on their shoulders, raced down to the departmental library,and danced in circles around the tables there. I couldn’t believe it.Passing the comps had the added bonus that I was henceforth excusedfrom all course grades. Three months after arriving at Harvard, “all” I hadleft to do was write my PhD thesis. Easy, right?No, not right at all. I was trying to learn enough about state-of-the-artmathematical physics research to make original contributions. But the researchpapers on my little desk were so difficult and abstract, using technicalresults from differential geometry and algebraic topology and more to proveimpenetrable theorems about 26-dimensional quantum field theories. I rememberlooking sadly at one such paper, and estimating that I would have to studyfor about two more years to understand its first sentence.I got worried and depressed. I had thought that applications of mathematicsto physics would be concrete and intuitive and fun, not impossibly difficultand abstract and intangible. It seemed that I would have to work so hard forso many years to even have a chance of earning a PhD. Meanwhile, I missedmy friends from Toronto, and all the fun times we had had. I didn’t see thepoint of continuing my studies, and considered moving back to Toronto andswitching to something more “practical” like computer programming. That’sright: aCOPSSwinnernearlydroppedoutofschool.11.2.3 Probability to the rescueWhile beating my head against the wall of mathematical physics, I had beencasually auditing a course in probability theory given by Persi Diaconis. Incontrast to all of the technical mathematics courses and papers I was strugglingwith, probability with Persi seemed fun and accessible. He presentednumerous open research problems which could be understood (though notsolved) in just a few minutes. There were connections and applications toother subjects and perhaps even to the “real world.” I had little to lose, soI nervously asked Persi if I could switch into probability theory. He agreed,and there I was.Istartedaresearchprojectaboutrandomrotationsinhighdimensions— more precisely, random walks on the compact Lie group SO(n). Althoughtoday that sounds pretty abstract to me, at the time it seemed relativelyconcrete. Using group representation theory, I got an initial result about themixing time of such walks. I was excited, and told Persi, and he was excitedtoo. I hoped to improve the result further, but for a few weeks I mostly justbasked in the glory of success after so much frustration.And then a horrible thing happened. I realized that my result was wrong!In the course of doing extensive calculations on numerous scraps of paper,I had dropped an “unimportant” constant multiplier. One morning it suddenlyoccurred to me that this constant couldn’t be neglected after all; on the


J.S. Rosenthal 121contrary, it nullified my conclusion. In short: a COPSS winner’s first researchresult was completely bogus.I felt sick and ashamed as I informed Persi of the situation, though fortunatelyhe was very kind and understanding. It did teach me a lesson, thatIdon’talwaysfollowbutalwaysshould:when you think you have a result,write it down very carefully to make sure it is correct.After that setback, I worked very hard for months. I wrote out long formulasfor group representation values. I simplified them using subtle calculustricks, and bounded them using coarse dominating integrals. I restricted toaparticularcase(whereeachrotationwas180degreesthroughsomehyperplane)to facilitate computations. Finally, hundreds of pages of scrap paperlater, I had actually proved a theorem. I wrote it up carefully, and finally myfirst research paper was complete. (All of the author’s research papers mentionedhere are available at www.probability.ca.) I knew I had a long roadahead — how long I could not estimate — but I now felt that I was on myway. I enthusiastically attended lots of research seminars, and felt like I wasbecoming part of the research community.Over the next couple of years, I worked on other related research projects,and slowly got a few other results. One problem was that I couldn’t reallyjudge how far along I was towards my PhD. Did I just need a few more resultsto finish, or was I still years away? I was mostly too shy or nervous to askmy supervisor, and he didn’t offer any hints. I finally asked him if I shouldperhaps submit my random rotations paper for publication in a research journal(a new experience for me), but he demurred, saying it was “too much ofaspecialcaseofaspecialcase,”whichnaturallydiscouragedmefurther.(Asit happens, after I graduated I submitted that very same random rotationspaper to the prestigious Annals of Probability, and it was accepted essentiallywithout change, leading me to conclude: PhD students should be encouragedto submit papers for publication. But I didn’t know that then.)I started to again despair for the future. I felt that if only I could finishmy PhD, and get tenure at a decent university, then life would be good. ButIwonderedifthatmomentwouldevercome.Indeed,IwasafutureCOPSSwinner who thought he would never graduate.A few weeks later, I was lifted out of my funk by a rather awkward occurrence.One Friday in November 1990, as I was leaving a research meetingwith Persi, he casually mentioned that perhaps I should apply for academicjobs for the following year. I was speechless. Did this mean he thought I wasalready nearly finished my PhD, even while I was despairing of graduatingeven in the years ahead? I left in a daze, and then spent the weekend puzzledand enthusiastic and worried about what this all meant. When Monday finallycame, I sought out Persi to discuss details. In a quick hallway conversation,I told him that if he really did think that I should apply for academic jobs,then I should get on it right away since some of the deadlines were already approaching.Right before my eyes, he considered for several seconds, and then


122 Lessons from a twisted career pathchanged his mind! He said that it might be better for me to wait another yearinstead.This was quite a roller coaster for me, and I’ve tried to remember tobe as clear as possible with PhD students about expectations and prognoses.Nevertheless, I was delighted to know that at least I would (probably) graduatethe following year, i.e., after a total of four years of PhD study. I was thrilledto see light at the end of the tunnel.Finally, the next year, I did graduate, and did apply for academic jobs. Theyear 1992 was a bad time for mathematical employment, and I felt pessimisticabout my chances. I didn’t even think to include my full contact information inmy applications, since I doubted anyone would bother to contact me. Indeed,when the photocopier added huge ink splotches to some of my applicationmaterials, I almost didn’t bother to recopy them, since I figured no one wouldread them anyway. Yes, afutureCOPSSwinnerbarelyevenconsideredthepossibility that anyone would want to offer him a job.11.3 Becoming a researcherTo my surprise, I did get job offers after all. In fact, negotiating the jobinterviews, offers, terms, and acceptances turned out to be quite stressful inand of itself — I wasn’t used to discussing my future with department chairsand deans!Eventually I arranged to spend 1.5 years in the Mathematics Departmentat the University of Minnesota. They had a large and friendly probabilitygroup there, and I enjoyed talking with and learning from all of them. It isgood to be part of a research team.I also arranged that I would move from Minnesota to the Statistics Departmentat my alma mater, the University of Toronto. I was pleased to return tothe city of my youth with all its fond memories, and to the research-focused(though administration-heavy) university. On the other hand, I was joining aStatistics Department even though Ihadnevertakenastatisticscourse.Fortunately, my new department did not try to “mold” me into a statistician;they let me continue to work as a mathematical probabilist. I applaudthem for this, and have come to believe that it is always best to let researcherspursue interests of their own choosing.Despite the lack of pressure, I did hear more about statistics (for the firsttime) from my new colleagues. In addition, I noticed something interesting inthe way my research papers were being received. My papers that focused ontechnical/mathematical topics, like random walks on Lie groups, were beingread by a select few. But my papers that discussed the theory of the newlypopularMarkov chain Monte Carlo (MCMC) computer algorithms, whichPersi with his usual foresight had introduced me to, were being cited by lots


J.S. Rosenthal 123of statistical researchers. This caused me to focus more on MCMC issues, andultimately on other statistical questions too. Of course, research should neverbe a popularity contest. Nevertheless, it is wise to work more on researchquestions which are of greater interest to others.In my case, these reactions led me to focus primarily on the theory ofMCMC, which served me very well in building my initial research career.Istillconsideredmyselfaprobabilist(indeed,Irecallsomeonereferringtome as “a statistician” around that time, and me feeling uncomfortable withthe designation), but my research more and more concerned applications tostatistical algorithms. I was publishing papers, and working very hard — myinitial small office was right off the main hallway, and colleagues commentedabout seeing my light still on at 10:30 or 11:00 many evenings. Indeed, researchsuccess always requires lots of hard work and dedication.11.3.1 Footprints in the sandMy interactions with the research community developed a slightly odd flavor.MCMC users were aware of my work and would sometimes cite it in generalterms (“for related theoretical issues, see Rosenthal”), but hardly anyonewould read the actual details of my theorems. Meanwhile, my department wassupportive from a distance, but not closely following my research. My statisticscolleagues were working on questions that I didn’t have the background to consider.And probability colleagues wouldn’t understand the statistical/MCMCmotivation and thus wouldn’t see the point of my research direction. So, despitemy modest research success, I was becoming academically somewhatisolated.That was to change when I met Gareth Roberts. He was a young Englishresearcher who also had a probability background, and was also studying theoreticalproperties of MCMC. The summer of 1994 featured three consecutiveconferences that we would both be attending, so I looked forward to meetinghim and exploring common interests. Our first encounter didn’t go well:I finally cornered him at the conference’s opening reception, only to hear himretort “Look, I’ve just arrived from England and I’m tired and jet-lagged; I’lltalk to you tomorrow.” Fortunately the next day he was in better form, andwe quickly discovered common interests not only in research, but also in music,sports, chess, bridge, jokes, and more. Most importantly, he had a similar(though more sophisticated) perspective about applying probability theoryto better understand the nature and performance of MCMC. We developeda fast friendship which has now lasted through 19 years and 33 visits and38 joint research papers (and counting). Gareth has improved my researchcareer and focus immeasurably; social relationships often facilitate researchcollaborations.My career still wasn’t all smooth sailing. Research projects always seemedto take longer than they should, and lead to weaker results than I’d hoped.Research, by its very nature, is a slow and frustrating process. Around that


124 Lessons from a twisted career pathtime, one of my PhD students had a paper rejected from a journal, and shylyasked me if that had ever happened to me. I had to laugh; of course it did!Yes, even COPSS winners get their papers rejected. Often. Nevertheless, I wasgetting papers published and doing okay as a researcher — not making anyhuge impact, but holding my own. I was honored to receive tenure in 1997,thus fulfilling my youthful dream, though that did lead to a depressing fewmonths of drifting and wondering “what should I do next.” A very unexpectedanswer to that question was to come several years later.11.3.2 The general publicLike many mathematical researchers, sometimes I felt frustrated thatI couldn’t easily explain my work to non-academics (joking that I was theguy no one wanted to talk to at a party), but I had never pursued this further.In 2003, some writers and journalists in my wife’s family decided thatI should write a probability book for the general public. Before I knew it, theyhad put me in touch with a literary agent, who got me to write a few samplechapters, which quickly scored us an actual publishing contract with Harper-Collins Canada. To my great surprise, and with no training or preparation,I had agreed to write a book for a general audience about probabilities ineveryday life, figuring that it is good to occasionally try something new anddifferent.The book took two years to write. I had to constantly remind myself thatwriting for a general audience was entirely different from writing a researchpaper or even a textbook. I struggled to find amusing anecdotes and catchyexamples without getting bogged down in technicalities. Somehow I pulled itoff: “Struck by Lightning: The Curious World of Probabilities” was publishedin sixteen editions and ten languages, and was a bestseller in Canada. Thisin turn led to numerous radio/TV/newspaper interviews, public lectures, appearancesin several documentaries, and invitations to present to all sorts ofdifferent groups and organizations. Completely unexpectedly, I became a littlebit of a “public persona” in Canada. This in turn led to several well-paidconsulting jobs (including one involving computer parsing of pdf files of customers’cell phone bills to compare prices), assisting with a high-profile mediainvestigation of a lottery ticket-swapping scandal and publishing about that inthe RCMP Gazette, serving as an expert witness in a brief to Supreme Courtof Canada, and more. You just can’t predict what twists your career will take.11.3.3 Branching out: CollaborationsIn a different direction, I have gradually started to do more interdisciplinarywork. As I have become slightly better known due to my research and/orbook and interviews, academics from a variety of departments have startedasking me to collaborate on their projects. I have found that it is impossibleto “prepare” for such collaborations — rather, you have to listen carefully and


J.S. Rosenthal 125be open to whatever research input your partners require. Nevertheless, due tosome combination of my mathematical and computer and social skills, I havemanaged to be more helpful than I would have predicted, leading to quite anumber of different joint papers. (I guess I am finally a statistician!)For example, I provided mathematical analysis about generators of creditrating transition processes for a finance colleague. I worked on several paperswith computer science and economics colleagues (one of which led to atricky probability problem, which in turn led to a nice probability paper withRobin Pemantle). I was also introduced to some psychologists working on analyzingyouth criminal offender data, which began a long-term collaborationwhich continues to this day. Meanwhile, an economics colleague asked me tohelp investigate temperature and population changes in pre-industrial Iceland.And a casual chat with some philosophy professors led to a paper about theprobability-related philosophical dilemma called the Sleeping Beauty problem.Meanwhile, I gradually developed a deeper friendship with my departmentcolleague Radu Craiu. Once again, social interaction led to discovering commonresearch interests, in this case concerning MCMC methodology. Raduand I ended up co-supervising a PhD student, and publishing a joint paperin the top-level Journal of the American Statistical Association (JASA), withtwo more papers in preparation. Having a longer-term collaborator withinmy own department has been a wonderful development, and has once againreminded me that it is good to be part of a research team.More recently, I met a speech pathologist at a lecture and gave her mycard. She finally emailed me two years later, asking me to help her analyzesubjects’ tongue positions when producing certain sounds. Here my undergraduatelinguistic course — taken with no particular goal in mind — wassuddenly helpful; knowledge can provide unexpected benefits. Our resultingcollaboration led to a paper in the Journal of the Acoustical Society of America,aprestigiousjournalwhichisalsomysecondonewiththefamousinitials“JASA.”I was also approached by a law professor (who was the son-in-law of arecently-retired statistics colleague). He wanted to analyze the text of supremecourt judgments, with an eye towards determining their authorship: did thejudge write it directly, or did their law clerks do it? After a few false starts, wemade good progress. I submitted our first methodological paper to JASA, butthey rejected it quickly and coldly, saying it might be more appropriate foran educational journal like Chance. That annoyed me at the time, but madeits later acceptance in The Annals of Applied Statistics all the more sweet. Afollow-up paper was published in the Cornell Law Review, and later referredto in the New York Times, and more related publications are on the way.These collaborations were all very different, in both content and process.But each one involved a personal connection with some other researcher(s),which after many discussions eventually led to worthwhile papers publishedin high-level research journals. I have slowly learned to always be on the


126 Lessons from a twisted career pathlookout for such connections: Unexpected encounters and social interactionscan sometimes lead to major new research collaborations.11.3.4 Hobbies to the foreAnother surprise for me has been the extent to which my non-research interestsand hobbies have in turn fed my academic activities in unexpected ways.As a child I did a lot of computer programming of games and other sillythings. When email and bulletin boards first came out, I used them too, eventhough they were considered unimportant compared to “real” computer applicationslike numerical computations. I thought this was just an idle pasttime.Years later, computer usage has become very central to my research andconsulting and collaborations: from Monte Carlo simulations to text processingto internet communications, I couldn’t function without them. And I’vebeen helped tremendously by the skills acquired through my “silly” childhoodhobby.I’d always played a lot of music with friends, just for fun. Later on, musicnot only cemented my friendship with Gareth Roberts, it also allowed me toperform at the infamous Bayesian conference “cabarets” and thus get introducedto more top researchers. In recent years, I even published an articleabout the mathematical relationships of musical notes, which in turn gave menew material for my teaching. Not bad for a little “fun” music jamming.In my late twenties I studied improvisational comedy, eventually performingin small local comedy shows. Unexpectedly, improv’s attitude of “embracingthe unexpected” helped me to be a more confident and entertainingteacher and presenter, turning difficult moments into humorous ones. This inturn made me better at media interviews when promoting my book. Comingfull circle, I was later asked to perform musical accompaniment to comedyshows, which I continue to this day.I’d always had a strong interest in Canadian electoral politics. I neverdreamed that this would impact my research career, until I suddenly foundmyself using my computer skills to analyze polling data and projections fromthe 2011 Canadian federal election, leading to a publication in The CanadianJournal of Statistics.Early in my teaching career, I experimented with alternative teaching arrangementssuch as having students work together in small groups duringclass time. (Such practices are now more common, but back in the early 1990sI was slightly ahead of my time.) To my surprise, that eventually led to apublication in the journal Studies in Higher Education.In all of these cases, topics I had pursued on their own merits withoutconnection to my academic career, turned out to be useful in my career afterall. So, don’t hesitate to pursue diverse interests — they might turn out to beuseful in surprising ways.


J.S. Rosenthal 12711.4 Final thoughtsDespite my thinking that I “had it all planned out,” my career has surprisedme many times over. I never expected to work in statistics. I had no idea thatMCMC would become such a central part of my research. I never planned towrite for the general public, or appear in the media. And I certainly neverdreamed that my music or improv or political interests would influence myresearch profile in any way.Nevertheless, I have been very fortunate: to have strong family and educationalfoundations, to attend top-level universities and be taught by top-levelprofessors, and to have excellent opportunities for employment and publishingand more. I am very grateful for all of this. And, it seems that my ultimatesuccess has come because of all the twists and turns along the way, not in spiteof them. Perhaps that is the real lesson here — that, like in improv, we shouldnot fear unexpected developments, but rather embrace them.My career, like most, has experienced numerous research frustrations, rejectedpapers, and dead ends. And my university’s bureaucratic rules andprocedures sometimes make me want to scream. But looking back, I recall myyouthful feeling that if only I could get tenure at a decent university, then lifewould be good. I was right: it has been.


12Promoting equityMary W. GrayDepartment of Mathematics and StatisticsAmerican University, Washington, DC12.1 Introduction“I’m not a gentleman.” This phrase should alert everyone to the fact that thisis not a conventional statistics paper; rather it is about the issues of equity thatthe Elizabeth Scott Award addresses. But that was the phrase that markedmy entry onto the path for which I received the award. It had become obviousto me early in my career as a mathematician that women were not playing on alevel field in the profession. The overt and subtle discrimination was pervasive.Of course, it was possible to ignore it and to get on with proving theorems. ButI had met Betty Scott and other women who were determined not to remainpassive. Where better to confront the issues but the Council of the AmericanMathematical Society (AMS), the group that considered itself the guardian ofthe discipline. I arrived at one of its meetings, eager to observe the operationof the august group. “This meeting is open only to Council members,” I wastold. Secure in having read the AMS By-laws, I quoted the requirement thatthe Council meetings be open to all AMS members. “Oh,” said the presidentof the society, “it’s a gentlemen’s agreement that they be closed.” I utteredmy open-sesame and remained.However, the way the “old boys’ network” operated inspired many mathematiciansto raise questions at Council meetings and more generally. Resultswere often discouraging. To the notion of introducing blind-refereeing of AMSpublications, the response from our distinguished colleagues was, “But howwould we know the paper was any good if we didn’t know who wrote it?”Outside Council premises at a national mathematics meeting a well-knownalgebraist mused, “We once hired a woman, but her research wasn’t verygood.” Faced with such attitudes about women in mathematics, a group of usfounded the Association for Women in Mathematics (AWM). In the 40+ yearsof its existence, AWM has grown to thousands of members from around theworld (about 15% of whom are men) and has established many travel grant129


130 Promoting equityand fellowship programs to encourage women and girls to study mathematicsand to support those in the profession. Prizes are awarded by AWM for theresearch of established mathematicians and undergraduates and for teaching.Through grant support from NSF, NSA, other foundations and individuals theorganization supports “Sonya Kovalevskaya Days” for high school and middleschool students at institutions throughout the country, events designed toinvolve the participants in hands-on activities, to present a broad view of theworld of mathematics, and to encourage a sense of comradeship with otherswho might be interested in mathematics.And things got better, at least in part due to vigorous advocacy work. Thepercentage of PhDs in math going to women went from 6% when I got my degreeto 30% before settling in the mid-twenties (the figure is more than 40% instatistics). Even “at the top” there was progress. The most prestigious universitydepartments discovered that there were women mathematicians worthy oftenure-track positions. A woman mathematician was elected to the NationalAcademy of Sciences, followed by several more. In its 100+ year history, theAMS finally elected two women presidents (the first in 1983, the last in 1995),compared with five of the seven most recent presidents of the American StatisticalAssociation (ASA). However, a combination of the still chilly climate forwomen in math and the fact that the gap between very abstract mathematicsand the work for political and social rights that I considered important ledme to switch to applied statistics.12.2 The Elizabeth Scott AwardTo learn that I had been selected for the Elizabeth Scott Award was a particularhonor for me. It was Betty Scott who was responsible in part for mydeciding to make the switch to statistics. Because I came to know of her effortsto establish salary equity at Berkeley, when a committee of the AmericanAssociation of University Professors decided to try to help other institutionsaccomplish a similar task, I thought of asking Betty to develop a kit (Scott,1977) to assist them.Using regression to study faculty salaries now seems an obvious technique;there are probably not many colleges or universities who have not tried to do soin the forty years since Title VII’s prohibition of discrimination in employmentbased on sex became applicable to college professors. Few will remember thatwhen it first was enacted, professors were exempted on the grounds that thequalifications and judgments involved were so subjective and specialized asto be beyond the requirement of equity and that in any case, discriminationwould be too difficult to prove. It still is difficult to prove, and unfortunatelyit still exists. That, through litigation or voluntary action, salary inequitieshave generally diminished is due partly to the cascade of studies based on


M.W. Gray 131the kit and its refinements. As we know, discrimination cannot be proved bystatistics alone, but extreme gender-disproportionate hiring, promotion andpay are extremely unlikely to occur by chance.Professor Scott was not happy with the “remedies” that institutions wereusing to “fix” the problem. Once women’s salaries were fitted to a male model,administrators appeared to love looking at a regression line and circling theobservations below the line. “Oh look, if we add $2000 to the salaries of a fewwomen and $1000 to the salaries of a few more, we will fix the problem,” theyexclaimed. Such an implementation relied on a common misunderstanding ofvariation. Sure, some women will be below the line and, of course, some women(although probably not many) may be above the male line. But what theregression models generally show is that the overall effect is that, on average,women are paid less than similarly qualified men. What is even worse, alltoo frequently administrators then engage in a process of showing that those“circled” women really “deserve” to be underpaid on some subjective basis,so that the discrimination continues.Not only does the remedy described confuse the individual observationwith the average, but it may be rewarding exactly the wrong women. Nomatter how many variables we throw in, we are unlikely entirely to account forlegitimate, objective variation. Some women are “better” than other women,or other men, with ostensibly similar qualifications, no matter what metric theinstitution uses: research, teaching, or service (Gray, 1988). But systematicallytheir salaries and those of all other women trail those of similarly qualifiedmen.What is appropriate for a statistically identified problem is a statisticallybased remedy. Thus Betty and I wrote an article (Gray and Scott, 1980),explaining that if the average difference between men’s and women’s salariesas shown by a regression model is $2000, then the salary of each woman shouldbe increased by that amount. Sorry to say, this is not an idea that has beenwidely accepted. Women faculty are still paid less on the whole, there arestill occasional regression-based studies, there are spot remedies, and oftenthe very best women faculty continue to be underpaid.The great interest in statistical evaluation of salary inequities, particularlyin the complex setting of academia, led to Gray (1993), which expanded onthe methods we used. Of course, statistical analysis may fail to show evidenceof inequity, but even when it does, a remedy may not be forthcoming becausethose responsible may be disinclined to institute the necessary changes. If theemployers refuse to remedy inequities, those aggrieved may resort to litigation,which is a long, expensive process, often not successful and almost alwayspainful.Experience as an expert witness teaches that however convincing the statisticalevidence might be, absent anecdotal evidence of discrimination anda sympathetic plaintiff, success at trial is unlikely. Moreover, one should notforget the frequent inability of courts to grasp the significance of statisticalevidence. Juries are often more willing to make the effort to understand and


132 Promoting equityevaluate such evidence than are judges (or attorneys on both sides). My favoriteexample of the judicial lack of comprehension of statistics was a USDistrict Court judge’s inability to understand that if women’s initial salarieswere less than men’s and yearly increases were across-the-board fixed percentages,the gap would grow progressively larger each year.A2007SupremeCourtdecision(Ledbetter,2007)madeachievingpayequityfor long-term victims of discrimination virtually impossible. The Courtdeclared that inequities in salary that had existed — and in many cases increased— for many years did not constitute a continuing violation of TitleVII. Litigation would have had to be instituted within a short time after thevery first discriminatory pay check in order for a remedy to be possible. Fortunatelythis gap in coverage was closed by the passage of the Lilly LedbetterFair Pay Act (Ledbetter, 2009), named to honor the victim in the SupremeCourt case; the path to equity may prove easier as a result.12.3 InsuranceSalary inequities are not the only obstacle professional women face. Therecontinues to exist discrimination in fringe benefits, directly financial as wellas indirectly through lab space, assignment of assistants, and exclusionaryactions of various sorts. Early in my career I received a notice from TeachersInsurance and Annuity Association (TIAA), the retirement plan used at mostprivate and many public universities including American University, listingwhat I could expect in retirement benefits from my contribution and thoseof the university in the form of x dollars per $100,000 in my account atage 65. There were two columns, one headed “women” and a second, withamounts 15% higher, headed “men.” When I contacted the company to pointout that Title VII prohibited discrimination in fringe benefits as well as insalary, I was informed that the figures represented discrimination on the basisof “longevity,” not on the basis of sex.When I asked whether the insurer could guarantee that I would live longerthan my male colleagues, I was told that I just didn’t understand statistics.Learning that the US Department of Labor was suing another university thathad the same pension plan, I offered to help the attorney in charge, the lateRuth Weyand, an icon in women’s rights litigation. At first we concentratedon gathering data to demonstrate that the difference in longevity betweenmen and women was in large part due to voluntary lifestyle choices, mostnotably smoking and drinking. In a settlement conference with the TIAAattorneys, one remarked, “Well, maybe you understand statistics, but youdon’t understand the law.”This provided inspiration for me to sign up for law courses, thinkingIwouldlearnalittle.However,itturnedoutthatIreallylovedthestudy


M.W. Gray 133of law and it was easy (relatively). While I was getting a degree and qualifyingas an attorney, litigation in the TIAA case and a parallel case involvinga state employee pension plan in Arizona (Arizona Governing Committee,1983) continued. The latter reached the US Supreme Court first, by whichtime I was also admitted to the Supreme Court Bar and could not only helpthe appellee woman employee’s lawyer but could write an amicus curiæ briefon my own.Working with a distinguished feminist economist, Barbara Bergmann, wecounteracted the legal argument of the insurer, namely that the law requiredthat similarly situated men and women be treated the same, but men andwomen were not so situated because of differences in longevity. To show thatthey are similarly situated, consider a cohort of 1000 men and 1000 womenat age 65. The death ages of 86% of them can be matched up. That is, a mandies at 66 and a woman dies at 66, a man dies at 90 and a woman dies at 90,etc. The remaining 14% consists of 7% who are men who die early unmatchedby the deaths of women and 7% who are women who live longer, unmatchedby long-lived men. But 86% of the cohort match up, i.e., men and women aresimilarly situated and must be treated the same (Bergmann and Gray, 1975).Although the decision in favor of equity mandated only prospective equalpayments, most plans, including TIAA, equalized retirement benefits resultingfrom past contributions as well. Women who had been doubly disadvantagedby discriminatorily low salaries and gender-based unjustly low retirement incometold of now being able to have meat and fresh fruit and vegetables a fewtimes a week as well as the security of a phone (this being before the so-called“Obama phone”).Whereas retirement benefits and employer-provided life insurance are coveredby Title VII, the private insurance market is not. Insurance is stateregulatedand only a few states required sex equity; more, but not all, requiredrace equity. A campaign was undertaken to lobby state legislatures,state insurance commissions, and governors to establish legal requirementsfor sex equity in all insurance. Of course, sex equity cuts both ways. Automobileinsurance and life insurance in general are more expensive for males thanfor females. In fact an argument often used to argue against discriminatoryrates is that in some states a young male driver with a clean record was beingcharged more than an older driver (of either sex) with two DUI convictions.Today women are still underrepresented in the study of mathematics, butthirty years ago, the disparity was even more pronounced. As a result, fewactivists in the women’s movement were willing and able to dig into the ratingpolicies of insurance companies to make the case for equity in insurance onthe basis of statistics. One time I testified in Helena, Montana, and then wasdriven in a snow storm to Great Falls to get the last flight out to Spokaneand then on to Portland to testify at a legislative session in Salem, leaving theopposition lobbyists behind. The insurance industry, having many resourcesin hand, had sent in a new team from Washington that was very surprised tosee in Salem the next day that they were confronted once again by arguments


134 Promoting equityfor equity. But with little or no money for political contributions, the judicialvictory was unmatched with legislative ones. It has been left to the AffordableCare Act (2009) to guarantee non-discrimination in important areas like healthinsurance. In the past, women were charged more on the ostensibly reasonablegrounds that childbirth is expensive and that even healthy women seek healthcare more often than do men. However, eventually childbirth expenses wereno longer a major factor and men begin to accrue massive health care bills,in particular due to heart attacks — in part, of course, because they fail tovisit doctors more frequently — but under the existing private systems, rateswere rarely adjusted to reflect the shift in the cost of benefits.12.4 Title IXExperience with the insurance industry led to more awareness of other areasof sex discrimination and to work with the Women’s Equity Action League(WEAL), a lobbying organization concentrating primarily on economic issuesand working for the passage and implementation of Title IX, which makes illegala broad range of sex discrimination in education. The success of Americanwomen in sports is the most often cited result of Title IX, but the legislationalso led to about half of the country’s new doctors and lawyers being women,once professional school admission policies were revised. Statistics also playedan important role in Title IX advances (Gray, 1996). Cohen versus BrownUniversity (Cohen, 1993) established that women and men must be treatedequitably with regard to opportunities to participate in and expenditures forcollegiate sports, relying on our statistical evidence of disparities. My favoritecase, however, involved Temple University, where the sports director testifiedthat on road trips women athletes were housed three to a room and mentwo to a room because “women take up less space” (Haffer, 1982). As noted,anecdotal evidence is always useful. In another Philadelphia case a course inItalian was offered at Girls High as the “equal” of a calculus course at themales-only Central High. The US Supreme Court let stand a lower court decisionthat this segregation by sex was not unconstitutional, but girls wereadmitted to Central High when the practice was later found unconstitutionalunder Pennsylvania law (Vorchheimer, 1976).12.5 Human rightsThe Elizabeth Scott Award cites my work exposing discrimination and encouragingpolitical action, which in fact extends beyond work for women in


M.W. Gray 135the mathematical sciences to more general defense of human rights. Hands-onexperience began when I represented the AMS in a delegation to Montevideothat also included members of the French Mathematical Society, the MexicanMathematical Society, and the Brazilian Applied Mathematics Society.The goal was to try to secure the release of José Luis Massera, a prominentUruguayan mathematician who had been imprisoned for many years. We visitedprisons, officials, journalists and others, ending up with the colonel whohad the power to release the imprisoned mathematician. We spoke of Massera’sdistinguished mathematics, failing health, and international concern for hissituation. The colonel agreed that the prisoner was an eminent personage andadded, “He will be released in two months.” To our astonishment, he was.Success is a great motivator; it led to work with Amnesty International(AI) and other organizations on cases, not only of mathematicians, but ofmany others unjustly imprisoned, tortured, disappeared, or murdered. Onceat the Council of the AMS, the issue of the people who had “disappeared”in Argentina, several of them mathematicians known to some on the Council,arose. Then another name came up, one that no one recognized, not surprisingbecause at the time she was a graduate student in mathematics. One of myfellow Council members suggested that as such she was not a “real” mathematicianand thus not worthy of the attention of the AMS. Fortunately, thatview did not prevail.Of the cases we worked on, one of the most memorable was that of MoncefBen Salem, a differential geometer, who had visited my late husband atthe University of Maryland before spending more than 20 years in prison orunder house arrest in his home country of Tunisia. As a result of the revolutionin 2011, he became Minister of Higher Education and Scientific Researchthere. Meeting in Tunis several times recently, we recalled the not-so-good olddays and focused on improvements in higher education and the mathematicseducation of young people. Much of my work in international developmentand human rights has come through my association with the American MiddleEast Education Foundation (Amideast) and Amnesty International, whereI served for a number of years as international treasurer — someone not afraidof numbers is always welcome as a volunteer in non-profit organizations. Integratingstatistics into human rights work has now become standard in manysituations.An opportunity to combine human rights work with statistics arose in theaftermath of the Rwanda genocide in the 90s. As the liberating force movedacross Rwanda from Tanzania in the east, vast numbers of people were imprisoned;two years later essentially the only way out of the prisons was death.The new government of Rwanda asserted quite correctly that the number ofprisoners overwhelmed what was left, or what was being rebuilt, of the judicialsystem. On the other hand, major funders, including the US, had alreadybuilt some new prisons and could see no end in sight to the incarceration ofa substantial portion of the country’s population. A basic human rights principleis the right to a speedy trial, with certain due process protections. The


136 Promoting equityAssistant Secretary for Democracy, Human Rights and Labor in the US StateDepartment, a former AI board member, conceived a way out of the impasse.Begin by bringing to trial a random sample of the internees. Unfortunatelythe Rwandan government was unhappy that the sample selected did not includetheir favorite candidates for trial and the scheme was not adopted. Ittook a decade to come up with a system including village courts to resolvethe problem.A few months after the invasion of Iraq, it became imperative to identifythe location and needs of schools in the country. Initial efforts at repair andrehabilitation had been unsuccessful because the schools could not be located— no street addresses were available in Iraq. Using teachers on school holidayfor the survey, we managed within two weeks to find the locations (using GPS)and to gather information on the status and needs of all but a handful of thehigh schools in the country. A later attempt to do a census of the region inIraq around Kirkuk proved impossible as each interested ethnic group wasdetermined to construct the project in such a way as to inflate the proportionof the total population that came from their group.Other projects in aid of human rights include work with Palestinian universitiesin the Occupied Territories, as well as with universities elsewhere in theMiddle East, North Africa, Latin America, and the South Pacific on curriculum,faculty governance, and training. A current endeavor involves workingthe American Bar Association to survey Syrian refugees in Jordan, Lebanon,and Turkey in order to document human rights abuses. Designing and implementingan appropriate survey has presented a huge challenge as more thanhalf of the refugees are not in camps and thus are difficult to locate as well asoften reluctant to speak of their experiences.12.6 Underrepresented groupsWomen are by no means the only underrepresented group among mathematicians.In the spirit of the Scott Award is my commitment to increasing theparticipation of African Americans, Latinos, American Indians and AlaskanNatives, Native Hawaiians and other Pacific Islanders. including National ScienceFoundation and Alfred P. Sloan funded programs for young students(Hayden and Gray, 1990), as well as support of PhD students, many of whomare now faculty at HBCUs. But certainly much remains to be done, here andabroad, in support of the internationally recognized right to enjoy the benefitsof scientific progress (ICESCR, 2013).Efforts in these areas are designed to support what Elizabeth Scott workedfor — equity and excellence in the mathematical sciences — and I am proudto have been given the award in her name for my work.


M.W. Gray 137ReferencesAffordable Health Care Act, Pub. L. 111–148 (2009).Arizona Governing Committee vs. Norris, 464 US 808 (1983).Article 15 of the International Covenant on Economic, Social and CulturalRights (ICESCR).Bergmann, B. and Gray, M.W. (1975). Equality in retirement benefits. CivilRights Digest, 8:25–27.Cohen vs. Brown University, 991 F. 2d 888 (1st Cir. 1993), cert. denied, 520US 1186 (1997).Gray, M.W. (1988). Academic freedom and nondiscrimination — enemies orallies? University of Texas Law Review, 6:1591–1615.Gray, M.W. (1993). Can statistics tell the courts what they do not want tohear? The case of complex salary structures. Statistical Science, 8:144–179.Gray, M.W. (1996). The concept of “substantial proportionality” in Title IXathletics cases. Duke Journal of Gender and Social Policy, 3:165–188.Gray, M.W. and Scott, E.L. (1980). A “statistical” remedy for statisticallyidentified discrimination. Academe, 66:174–181.Haffer vs. Temple University, 688 F. 2d 14 (3rd Cir. 1982).Hayden, L.B. and Gray, M.W. (1990). A successful intervention program forhigh ability minority students. School Science and Mathematics, 90:323–333.Ledbetter vs. Goodyear Tire & Rubber, 550 US 618 (2007).Lilly Ledbetter Fair Pay Act, Pub. L. 112–2 (2009).Scott, E.L. (1977). Higher Education Salary Evaluation Kit. American Associationof University Professors, Washington, DC.Vorchheimer vs. School District of Philadelphia, 532 F. 2d 880 (3rd Cir.1976).


Part IIIPerspectives on the fieldand profession


13Statistics in service to the nationStephen E. FienbergDepartment of StatisticsCarnegie Mellon University, Pittsburgh, PA13.1 IntroductionLet me begin with a technical question:Who first implemented large-scale hierarchical Bayesian models, when,and why?I suspect the answer will surprise you. It was none other than John Tukey, in1962, with David Wallace and David Brillinger, as part of the NBC ElectionNight team. What I am referring to is their statistical model for predictingelection results based on early returns; see Fienberg (2007).The methods and election night forecasting model were indeed novel, andare now recognizable as hierarchical Bayesian methods with the use of empiricalBayesian techniques at the top level. The specific version of hierarchicalBayes in the election night model remains unpublished to this day, butTukey’s students and his collaborators began to use related ideas on “borrowingstrength,” and all of this happened before the methodology was describedin somewhat different form by I.J. Good in his 1965 book (Good, 1965) andchristened as “hierarchical Bayes” in the classic 1970 paper by Dennis Lindleyand Adrian Smith (Lindley and Smith, 1972); see also Good (1980).I was privileged to be part of the team in 1976 and 1978, and there wereclose to 20 PhD statisticians involved in one form or another, working inCherry Hill, New Jersey, in the RCA Lab which housed a large mainframecomputer dedicated to the evening’s activities (as well as a back-up computer),and a few in New York interacting with the NBC “decision desk.” The analystseach had a computer terminal and an assignment of states and politicalraces. A summary of each run of the model for a given race could be read easilyfrom the terminal console but full output went to a nearby line printer andwas almost immediately available for detailed examination. Analysts workedwith the model, often trying different prior distributions (different past electionschosen as “models” for the ones for which they were creating forecasts)141


142 Statistics in service to the nationand checking on robustness of conclusions to varying specifications. This experiencewas one among many that influenced how I continue to think aboutmy approach to Bayesian hierarchical modeling and its uses to the presentday; see, e.g., Airoldi et al. (2010).All too often academic statisticians think of their role as the productionof new theory and methods. Our motivation comes in large part from thetheoretical and methodological work of others. Our careers are built on, andjudged by, our publications in prestigious journals. And we often build ourcareers around such research and publishing activities.In this contribution, I want to focus on a different role that many statisticianshave and should play, and how this role interacts with our traditionalrole of developing new methods and publishing in quality journals. This isthe role we can fulfill in support of national activities and projects requiringstatistical insights and rigor. Much of my story is autobiographical, largelybecause I know best what has motivated my own efforts and research. Further,I interpret the words “to the nation” in my title quite liberally, so thatit includes election night forecasts and other public uses of statistical ideasand methods.In exploring this theme, I will be paying homage to two of the most importantinfluences on my view of statistics: Frederick Mosteller, my thesisadvisor at Harvard, and William Kruskal, my senior colleague at the Universityof Chicago, where I went after graduation; see Fienberg et al. (2007).One of the remarkable features of Mosteller’s autobiography (Mosteller,2010) is that it doesn’t begin with his early life. Instead, the volume beginsby providing chapter-length insider accounts of his work on six collaborative,interdisciplinary projects: evaluating the pre-election polls of 1948, statisticalaspects of the Kinsey report on sexual behavior in the human male, mathematicallearning theory, authorship of the disputed Federalist papers, safety ofanesthetics, and a wide-ranging examination of the Coleman report on equalityof educational opportunity. With the exception of mathematical learningtheory, all of these deal with important applications, where new theoryand methodology or adaption of standard statistical thinking was important.Mosteller not only worked on such applications but also thought it importantto carry the methodological ideas back from them into the mainstreamstatistical literature.A key theme I emphasize here is the importance of practical motivationof statistical theory and methodology, and the iteration between applicationand theory. Further, I want to encourage readers of this volume, especiallythe students and junior faculty, to get engaged in the kinds of problems I’lldescribe, both because I’m sure you will find them interesting and also becausethey may lead to your own professional development and advancement.


S.E. Fienberg 14313.2 The National Halothane StudyI take as a point of departure The National Halothane Study (Bunker et al.,1969). This was an investigation carried out under the auspices of the NationalResearch Council. Unlike most NRC studies, it involved data collection anddata analysis based on new methodology. The controversy about the safety ofthe anesthetic halothane was the result of a series of published cases involvingdeaths following the use of halothane in surgery. Mosteller offers the followingexample:“A healthy young woman accidentally slashed her wrists on a brokenwindowpane and was rushed to the hospital. Surgery was performedusing the anesthetic halothane with results that led everyone to believethat the outcome of the treatment was satisfactory, but a few days laterthe patient died. The cause was traced to massive hepatic necrosis —so many of her liver cells died that life could not be sustained. Suchoutcomes are very rare, especially in healthy young people.” (Mosteller,2010, p. 69)The NRC Halothane Committee collected data from 50,000 hospitalrecords that were arrayed in the form of a very large, sparse multi-way contingencytable, for 34 hospitals, 5 anesthetics, 5 years, 2 genders, 5 age groups,7 risk levels, type of operation, etc., and of course survival. There were 17,000deaths. A sample of 25 cases per hospital to estimate the denominator madeup the residual 33,000 cases.When we say the data are sparse we are talking about cells in a contingencytable with an average count less than 1! The common wisdom of the day, backin the 1960s, was that to analyze contingency tables, one needed cell countsof 5 or more, and zeros in particular were an anathema. You may even haveread such advice in recent papers and books.The many statisticians involved in the halothane study brought a numberof standard and new statistical ideas to bear on this problem. One of these wasthe use of log-linear models, work done largely by Yvonne Bishop, who was agraduate student at the time in the Department of Statistics at Harvard. Theprimary theory she relied upon was at that time a somewhat obscure paperby an Englishman named Birch (1963), whose theorem on the existence ofmaximum likelihood estimates assumed that all cell counts are positive. Butshe needed a way to actually do the computations, at a time when we werestill carrying boxes of punched cards to the computer center to run batchprograms!The simple version of the story — see Fienberg (2011) for a more technicalaccount — was that Yvonne used Birch’s results (ignoring the condition onpositive cell counts) to derive connections between log-linear and logit models,and she computed maximum likelihood estimates (MLEs) using a version ofthe method of iterative proportional fitting (IPF), developed by Deming and


144 Statistics in service to the nationStephan in 1940 for a different but related problem. She applied this newmethodology to the halothane study. Because the tables of interest from thisstudy exceeded the capacity of the largest available computers of the day,she was led to explore ways to simplify the IPF calculations by multiplicativeadjustments to the estimates for marginal tables — an idea related to modelswith direct multiplicative estimates such as conditional independence. Theamazing thing was that the ideas actually worked and the results were acrucial part of the 1969 published committee report and formed the heart ofher 1967 PhD thesis (Bishop, 1967).Now let me step back to the summer of 1966, when Mosteller suggesteda pair of different research problems to me involving contingency tables, oneof which involving Bayesian smoothing — the smoothing problem utilized theidea of hierarchical models and thus linked in ways I didn’t understand at thetime to the election night forecasting model from NBC. Both of these problemswere motivated by the difficulties Mosteller had encountered in his work onthe halothane study and they ended up in my 1968 PhD thesis (Fienberg,1968).It was only later that I began to think hard about Yvonne’s problemof zeros and maximum likelihood estimation. This work involved several ofmy students (and later colleagues), most notably Shelby Haberman (1974)and Mike Meyer. Today the computer programs for log-linear model methodsare rooted in code originally developed by Yvonne and Shelby; as for thetheoretical question of when zeros matter and when they do not for log-linearmodels, it was finally resolved by my former Carnegie Mellon student and nowcolleague, Alessandro Rinaldo (Rinaldo, 2005; Fienberg and Rinaldo, 2012).The log-linear model work took on a life of its own, and it led to the bookwith Yvonne and Paul Holland, “Discrete Multivariate Analysis” (1975). Othersrefer to this book as “the jolly green giant” because of the color of the originalcover, and it included many new applications involving extensions of thecontingency table ideas. But, over my career, I have found myself constantlyreturning to problems that treat this log-linear model work as a starting point.I’d like to mention some of these, but only after introducing some additionalchronological touch-points.13.3 The President’s Commission and CNSTATIn 1970, when I was a junior faculty member at the University of Chicago,my senior colleague, Bill Kruskal, seemed to be headed to Washington withenormous frequency. After a while, I learned about his service on the President’sCommission on Federal Statistics, led by W. Allan Wallis, who was thenPresident of the University of Rochester, and Fred Mosteller. When the Commissionreported in 1971, it included many topics that provide a crosswalk


S.E. Fienberg 145between the academic statistics community and statisticians in the federalgovernment.The release of the two-volume Report of the President’s Commission onFederal Statistics was a defining moment not only for the Federal StatisticalSystem, but also for the National Academy of Sciences. It had many recommendationsto improve aspects of the Federal statistical system and itsco-ordination. One topic explored at length in the report was privacy andconfidentiality — I’ll return to this shortly. For the moment I want to focuson the emphasis in the report on the need for outside advice and assessmentfor work going on the federal government:Recommendation 5–4: TheCommissionrecommendsthataNationalAcademy of Sciences–National Research Council committee beestablished to provide an outside review of federal statistical activities.That committee was indeed established a few years later as the Committeeon National Statistics (CNSTAT) and it has blossomed to fulfill not only therole envisioned by the Commission members, but also to serve as a repositoryof statistical knowledge, both about the system and statistical methodologyfor the NRC more broadly. The agenda was well set by the committee’s firstchair, William Kruskal, who insisted that its focus be “national statistics” andnot simply “federal statistics,” implying that its mandate reaches well beyondthe usual topics and problems associated with the federal statistics agenciesand their data series.I joined the committee in 1978 and served as Chair from 1981 through1987. CNSTAT projects over the past 35 years serve as the backdrop for theother topics I plan to cover here.13.4 Census-taking and multiple-systems estimationOne of the most vexing public controversies that has raged for the betterpart of the last 40 years, has been the accuracy of the decennial census. Asearly as 1950 the Census Bureau carried out a post enumeration survey togauge the accuracy of the count. NRC committees in 1969, and again in 1979,addressed the topic of census accuracy and the possibility that the censuscounts be adjusted for the differential undercount of Blacks. Following the1980 census, New York City sued the Bureau, demanding that it use a pair ofsurveys conducted at census time to carry out an adjustment. The proposedadjustment methodology used a Bayesian version of something known as dualsystemestimation, or capture-recapture methodology. For those who haveread my 1972 paper (Fienberg, 1972) or Chapter 6 of Bishop, Fienberg andHolland (1975), you will know that one can view this methodology as a variantof a special case of log-linear model methodology.


146 Statistics in service to the nationIn the 1980s and again in the 1990s, I was a member of a CNSTAT paneladdressing this and other methodological issues. Several authors during thisperiod wrote about multiple-system estimation in the census context. I wasone of these authors, a topic to which I contributed.Political pressure and lawsuits have thwarted the use of this methodologyas a formal part of the census. Much of this story is chronicled in my 1999book with Margo Anderson, “Who Counts? The Politics of Census-Taking inContemporary America” (Anderson and Fienberg, 1999). In conjunction withthe 2000 decennial census, the Bureau used related log-linear methodology toproduce population estimates from a collection of administrative lists. Thiswork was revived following the 2010 census. In the meantime, I and others haveproduced several variants on the multiple-recapture methodology to deal withpopulation heterogeneity, and I have a current project funded by the CensusBureau and the National Science Foundation that is looking once again atways to use these more elaborate approaches involving multiple lists for bothenumeration and census accuracy evaluation.I’m especially proud of the fact that the same tools have now emergedas major methodologies in epidemiology in the 1990s (IWG, 1995a,b) andin human rights over the past decade (Ball and Asher, 2002). This is anamazing and unexpected consequence of work begun in a totally differentform as a consequence of the National Halothane Study and the methodologyit spawned.13.5 Cognitive aspects of survey methodologyOne of the things one learns about real sample surveys is that the measurementproblems, especially those associated with questionnaire design, are immense.IlearnedthisfirsthandwhileworkingwithdatafromtheNationalCrimeSurvey and as a technical advisor to the National Commission on Employmentand Unemployment Statistics, both in the 1970s. These matters rarely, if ever,show up in the statistics classroom or in textbooks, and they had long beenviewed as a matter of art rather than science.Triggered by a small 1980 workshop on victimization measurement andcognitive psychology, I proposed that the NRC sponsor a workshop on cognitiveaspects of survey measurement that would bring together survey specialists,government statisticians, methodologically oriented statisticians, andcognitive scientists. My motivation was simple: the creative use of statisticalthinking could suggest new ways of carrying out interviews that ultimatelycould improve not only specific surveys, but the field as a whole. Under theleadership of Judy Tanur and Miron Straf, the Committee on National Statisticshosted such a workshop in the summer of 1983 and it produced a widelycitedvolume (Jabine et al., 1984), as well as a wide array of unorthodox ideas


S.E. Fienberg 147that have now been instantiated in three major US statistical agencies andare part of the training of survey statisticians. Census Bureau surveys andcensus forms are now regularly developed using cognitive principles. Today,few students or practitioners understand the methodological roots of this enterprise.Tanur and I also linked these ideas to our ongoing work on combiningexperiments and surveys (Fienberg and Tanur, 1989).13.6 Privacy and confidentialityImentionedthatthePresident’sCommissionexploredthetopicofprivacyand confidentiality in depth in its report. But I failed to tell you that most ofthe discussion was about legal and other protections for statistical databases.Indeed, as I participated in discussions of large government surveys throughoutthe 1970s and 1980s, the topic was always present but rarely in the formthat you and I would recognize as statistical. That began to change in themid-1980s with the work of my then colleagues George Duncan and DianeLambert (1986, 1989) interpreting several rules for the protection of confidentialityin government practice using the formalism of statistical decisiontheory.IwasfinallydrawnintotheareawhenIwasaskedtoreviewthestatisticsliterature on the topic for a conference in Dublin in 1992; see Fienberg (1994).IdiscoveredwhatIliketorefertoasastatisticalgold-minewhoseveinsIhavebeen working for the past 21 years. There has been a major change since thePresident’s Commission report, linked to changes in the world of computingand the growth of the World Wide Web. This has produced new demandsfor access to statistical data, and new dangers of inappropriate record linkageand statistical disclosures. These are not simply national American issues, butrather they are international ones, and they have stimulated exciting technicalstatistical research.The Committee on National Statistics has been actively engaged in thistopic with multiple panels and workshops on different aspects of privacy andconfidentiality, and there is now substantial technical statistical literature onthe topic, and even a new online Journal of Privacy and Confidentiality; seehttp://repository.cmu.edu/jpc/.How does this work link to other topics discussed here? Well, much of myresearch has dealt with the protection of information in large sparse contingencytables, and it will not surprise you to learn that it ties to theory onlog-linear models. In fact, there are also deep links to the algebraic geometryliterature and the geometry of 2 × 2 contingency tables, one of those problemsFred Mosteller introduced me to in 1966. See Dobra et al. (2009) for somedetails.


148 Statistics in service to the nationFIGURE 13.1Sensitivity and false positive rates in 52 laboratory datasets on polygraphvalidity. Reprinted with permission from The Polygraph and Lie Detectionby the National Academy of Sciences (NRC, 2003). Courtesy of the NationalAcademies Press, Washington, DC.13.7 The accuracy of the polygraphIn 2000, I was asked to chair yet another NRC committee on the accuracy ofpolygraph evidence in the aftermath of the “Wen-Ho Lee affair” at Los AlamosNational Laboratory, and the study was in response to a congressional request.My principal qualification for the job, beyond my broad statistical backgroundand my research and writing on forensic topics, was ignorance — I had neverread a study on the polygraph, nor had I been subjected to a polygraph exam.I will share with you just one figure from the committee’s report (NRC, 2003).Figure 13.1 takes the form of a receiver operating characteristic plot (orROC curve plot) that is just a “scatterplot” showing through connected linesthe sensitivity and specificity figures derived from each of the 52 laboratorystudies that met the committee’s minimal quality criteria. I like to refer to thisas our “show me the data” plot. Each study has its own ROC curve, pointsconnected by dotted lines. Each point comes from a 2 × 2contingencytable.You can clearly see why we concluded that the polygraph is better than chancebut far from perfect! Further, there are two smooth curves on the graph representingthe accuracy scores encompassed by something like the interquartilerange of the experimental results. (Since there does not exist a natural definitionfor quartiles for multivariate data of this nature, the committee first


S.E. Fienberg 149computed the area under the curve for each study, A, rank-ordered the valuesof A, and then used symmetric ROC curves corresponding to the quartiles forthese values. These curves also happen to enclose approximately 50% of thedata points as well.) The committee chose this scatterplot, which includes essentiallyall of the relevant data on accuracy from the 52 studies, rather than asingle summary statistic or one with added standard error bounds because wejudged it important to make the variability in results visible, and because ourinternal analyses of the characteristics of the studies left us suspicious that thevariability was non-random. Polygraph accuracy likely depends on unknownspecifics of the test situation, and we did not want to create the impressionthat there is a single number to describe polygraph accuracy appropriatelyacross situations.Although I thought this was going to be a one-of-a-kind activity, I guessI should have known better. About a year after the release of the report,I testified at a senate hearing on the Department of Energy’s polygraph policy(Fienberg and Stern, 2005), and continue to be called upon by the mediato comment on the accuracy of such methods for the detection of deception— as recently as this month, over 10 years after the publication of our report.13.8 Take-home messagesWhat are the lessons you might leave this chapter having learned?(a) First, it’s fun to be a statistician, especially because we can ply our sciencein a diverse set of ways. But perhaps most of you already knew that.(b) Second, big problems, especially those confronting the nation, almost alwayshave a statistical component and working on these can be rewarding,both personally and professionally. Your work can make a difference.(c) Third, working on even small aspects of these large national problemscan stimulate the creation of new statistical methodology and even theory.Thus if you engage in such activities, you will still have a chance to publishin the best journals of our field.(d) Fourth, such new methods often have unplanned-for applications in otherfields. This will let you, as a statistician, cross substantive boundaries innew and exciting ways.Who would have thought that working on a few problems coming out of theHalothane Study would lead to a new integrated set of models and methodsthat would have impact in many fields and in different forms?


150 Statistics in service to the nationReferencesAiroldi, E.M., Erosheva, E.A., Fienberg, S.E., Joutard, C., Love, T.M., andShringarpure, S. (2010). Reconceptualizing the classification of PNASarticles. Proceedings of the National Academy of Sciences, 107:20899–20904.Anderson, M. and Fienberg, S.E. (1999). Who Counts? The Politics ofCensus-Taking in Contemporary America. Russell Sage Foundation, NewYork.Ball, P. and Asher, J. (2002). Statistics and Slobodan: Using data analysisand statistics in the war crimes trial of former President Milosevic.Chance, 15(4):17–24.Birch, M.W. (1963). Maximum likelihood in three-way contingency tables.Journal of the Royal Statistical Society, Series B, 25:220–233.Bishop, Y.M.M. (1967). Multi-Dimensional Contingency Tables: Cell Estimates.PhD thesis, Department of Statistics, Harvard University, Cambridge,MA.Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. (1975). Discrete MultivariateAnalysis: Theory and Practice. MITPress,CambridgeMA.Bunker, J.P., Forrest, W.H., Mosteller, F., and Vandam, L.D., Eds. (1969).The National Halothane Study. US Government Printing Office, WashingtonDC.Dobra, A., Fienberg, S.E., Rinaldo, A., Slavkovic, A., and Zhou, Y. (2009).Algebraic statistics and contingency table problems: Log-linear models,likelihood estimation, and disclosure limitation. In Emerging Applicationsof Algebraic Geometry (M. Putinar and S. Sullivant, Eds.). IMA Volumesin Mathematics and its Applications, vol.148.Springer,NewYork,pp.63–88.Duncan, G.T. and Lambert, D. (1986). Disclosure-limited data dissemination(with discussion). Journal of the American Statistical Association, 81:10–28.Duncan, G.T. and Lambert, D. (1989). The risk of disclosure for microdata.Journal of Business & Economic Statistics, 7:207–217.Fienberg, S.E. (1968). The Estimation of Cell Probabilities in Two-WayContingency Tables. PhD thesis, Department of Statistics, Harvard University,Cambridge, MA.


S.E. Fienberg 151Fienberg, S.E. (1972). The multiple recapture census for closed populationsand incomplete 2 k contingency tables. Biometrika, 59:591–603.Fienberg, S.E. (1994). Conflicts between the needs for access to statisticalinformation and demands for confidentiality. Journal of Official Statistics,10:115–132.Fienberg, S.E. (2007). Memories of election night predictions past: Psephologistsand statisticians at work. Chance, 20(4):8–17.Fienberg, S.E. (2011). The analysis of contingency tables: From chi-squaredtests and log-linear models to models of mixed membership. Statistics inBiopharmaceutical Research, 3:173–184.Fienberg, S.E. and Rinaldo, A. (2012). Maximum likelihood estimation inlog-linear models. The Annals of Statistics, 40:996–1023.Fienberg, S.E. and Stern, P.C. (2005). In search of the magic lasso: Thetruth about the polygraph. Statistical Science, 20:249–260.Fienberg, S.E., Stigler, S.M., and Tanur, J.M. (2007). The William Kruskallegacy: 1919–2005 (with discussion). Statistical Science, 22:255–278.Fienberg, S.E. and Tanur, J.M. (1989). Combining cognitive and statisticalapproaches to survey design. Science, 243:1017–1022.Good, I.J. (1965). The Estimation of Probabilities. MITPress,Cambridge,MA.Good, I.J. (1980). Some history of the hierarchical Bayesian methodology(with discussion). In Bayesian Statistics: Proceedings of the First InternationalMeeting in Valencia (Spain) (J.M. Bernardo, M.H. DeGroot, D.V.Lindley, and A.F.M. Smith, Eds.), Universidad de Valencia, pp. 489–519.Haberman, S.J. (1974). Analysis of Frequency Data. University of ChicagoPress, Chicago, IL.International Working Group for Disease Monitoring and Forecasting (1995).Capture-recapture and multiple-record systems estimation I: History andtheoretical development. American Journal of Epidemiology, 142:1047–1058.International Working Group for Disease Monitoring and Forecasting. (1995).Capture-recapture and multiple-record systems estimation II: Applicationsin human diseases. American Journal of Epidemiology, 142:1059–1068.Jabine, T.B., Straf, M.L., Tanur, J.M., and Tourangeau, R., Eds. (1984).Cognitive Aspects of Survey Methodology: Building a Bridge Between Disciplines.National Academies Press, Washington, DC.


152 Statistics in service to the nationLindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linearmodel (with discussion). Journal of the Royal Statistical Society, SeriesB, 34:1–41.Mosteller, F. (2010). The Autobiography of Frederick Mosteller (S.E. Fienberg,D.C. Hoaglin, and J.M. Tanur, Eds.). Springer, New York, 2010.National Research Council Committee to Review the Scientific Evidenceon the Polygraph. (2003). The Polygraph and Lie Detection. NationalAcademies Press, Washington DC.President’s Commission on Federal Statistics (1971). Federal Statistics. VolumesI and 2. US Government Printing Office, Washington, DC.Rinaldo, A. (2005). On Maximum Likelihood Estimates for ContingencyTables. PhD thesis, Department of Statistics, Carnegie Mellon University,Pittsburgh, PA.


14Where are the majors?Iain M. JohnstoneDepartment of StatisticsStanford University, Stanford, CAWe present a puzzle in the form of a plot. No answers are given.14.1 The puzzleFigure 14.1 suggests that the field of statistics in the US is spectacularly anduniquely unsuccessful in producing Bachelor’s degree graduates when comparedwith every other undergraduate major for which cognate AdvancedPlacement (AP) courses exist. In those same subject areas, statistics also producesby far the fewest Bachelor’s degrees when the normalization is takenas the number of doctoral degrees in that field. We are in an era in whichdemand for statistical/data scientists and analytics professionals is exploding.The puzzle is to decide whether this plot is spur to action or merely anirrelevant curiosity.14.2 The dataThe “Subjects” are, with some grouping, those offered in AP courses by theCollege Board in 2006. The number of students taking these courses varieswidely, from over 500,000 in English to under 2,000 in Italian. In the 2006data, the statistics number was about 88,000 (and reached 154,000 in 2012).The National Center for Education Statistics (NCES) publishes data onthe number of Bachelor’s, Master’s and doctoral degrees conferred by degreegranting institutions.Bachelor-to-AP ratio is, for each of the subjects, the ratio of the numberof Bachelor’s degrees in 2009–10 to the number of AP takers in 2006. One153


154 Where are the majors? ArtBachelortoPhD2 5 10 20 50 Statistics PoliSci Spanish History English Envir.Sci French Economics Comp.Sci German Geography Psychology Classics Italian Mathematics Biology Music Chemistry Physics0.005 0.020 0.050 0.200 0.500 2.000 5.000BachelortoAPFIGURE 14.1Normalized Bachelor’s degree productivity by discipline.perhaps should not think of it as a “yield” of Bachelor’s degrees from APtakers, but rather as a measure of the number of Bachelor’s degrees normalizedby Subject size.Bachelor-to-PhD ratio, for each of the subjects, is the ratio of the numberof Bachelor’s degrees in 2009–10 to the number of doctoral degrees. Again itis not a measure of the “yield” of doctoral degrees from Bachelor’s degrees.The data (and sources) are in file degree.xls. This and other files, includingR code, are in directory Majors on the page that accompanies thisvolume: http://www.crcpress.com/product/isbn/9781482204964.14.3 Some remarks1. It is likely that the number of Bachelors’ degrees in statistics is undercounted,as many students may be getting the equivalent of a statisticsmajor in a Mathematics Department. The undercount would have to beby a factor of five to ten to put statistics into the bulk of the plot.


I.M. Johnstone 155The NCES data shows almost three times as many Master’s degrees inStatistics for 2009–10 as Bachelor’s degrees (1602 versus 593). This issignificant, and may in part reflect the undercount just cited. However, afocus on the Bachelor’s level still seems important, as this is seen as theflagship degree at many institutions.2. There have always been many routes to a career in statistics. For a mathematicallyinclined student, it often used to be said that an undergraduatefocus on statistics was too narrow, and that a broad training in mathematicswas a better beginning. My own department did not offer anundergraduate degree in statistics for this reason.In 2013, it seems that young people who like data and mathematics shouldstudy at least computing, mathematics, and statistics (the order is alphabetical)and that Statistics Departments might design and promote majorsthat encourage that breadth.3. An Academic Dean spends a lot of time reading letters of evaluation ofteaching, research, etc. In my own experience, the most consistently outstandingstudent letters about teaching were in philosophy. Why? Here isa conjecture rather than an explanation: in the US, philosophy is not ahigh school subject — and there is no AP exam — and so its professorshave to “start from scratch” in winning converts to the philosophy major.They appear to do this by dint of exemplary teaching.What is the relevance of this to statistics? I’m not sure, but statisticslikewise has not traditionally been a high school subject (this is changing,as the advent of AP statistics shows), and it seems that statisticians inresearch universities have not in the past felt the same sense of missionabout recruiting to an undergraduate major in statistics.4. These are data aggregated at a national level, and — perhaps — pointto a national phenomenon. The good news is that there are outstandingexamples of individual departments promoting and growing statisticsmajors.AcknowledgementsThanks to Robert Gould, Nicholas Horton, and Steve Pierson for commentson an earlier draft. After much procrastination, this short piece was writtenwhile not preparing a conference talk, in conformance with the Ig NobelPrize winning article by Perry (1996). The title echoes that of the sometimesforgotten classic Schwed Jr. (1940), without any other claimed similarity.


156 Where are the majors?ReferencesPerry, J. (1996). How to procrastinate and still get things done. Chronicle ofHigher Education.http://chronicle.com/article/How-to-ProcrastinateStill/93959Schwed, F. Jr. (1940). Where Are the Customers’ Yachts? Or a Good HardLook at Wall Street. Simon and Schuster, Toronto, Canada.


15We live in exciting timesPeter G. HallDepartment of Mathematics and StatisticsUniversity of Melbourne, AustraliaandDepartment of StatisticsUniversity of California, Davis, CA15.1 Introduction15.1.1 The beginnings of computer-intensive statisticsMy involvement in statistics research started at about the time that significantinteractive computing power began to become available in universitystatistics departments. Up until that point, those of us using powerful electroniccomputers in universities generally were at the mercy of punch cardsoperating main frame computers, typically at relatively distant locations. Thisseverely hindered the use of computers for assessing the performance of statisticalmethodology, particularly for developing new approaches. However, oncecomputational experiments could be performed from one’s desk, and parametersettings adjusted as the results came in, vast new horizons opened up formethodological development.The new statistical approaches to which this led were able, by virtue ofpowerful statistical computing, to do relatively complex things to data. Formany of us, Cox’s regression model (Cox, 1972), and Efron’s bootstrap (Efron,1979a), became feasible only in the late 1970s and early 1980s. Efron (1979b)gave a remarkably prescient account of the future relationship between theoryand computation in modern statistics, noting that:“The need for a more flexible, realistic, and dependable statistical theoryis pressing, given the mountains of data now being amassed. Theprospect for success is bright, but I believe the solution is likely tolie along the lines suggested in the previous sections — a blend oftraditional mathematical thinking combined with the numerical andorganizational aptitude of the computer.”157


158 Exciting timesCritically, Efron saw theory and computing working together to ensure the developmentof future statistical methodology, meeting many different demands.And, of course, that is what happened, despite the concerns of some (see, e.g.,Hall, 2003, p. 165) that advances in computing threatened to replace theoreticalstatistical argument.15.1.2 Computer-intensive statistics and meIn this short paper I shall give a personal account of some of my experiencesat this very exciting time. My intention is to focus particularly on a 15-yearperiod, from the late 1970s to the early 1990s. However, it will be necessaryfrom time to time to stray to the present, in order to make a didactic point,and to look back at the past, so as to see how much we still have in commonwith our forebears like R.A. Fisher and E.J.G. Pitman.I feel particularly fortunate to have been able to work on the developmentof statistical methodology during such an era of seminal change. I’ll explainhow, as in many of the most important things in life, I came to this rolelargely by accident, while looking for a steady job in probability rather thanstatistics; and how the many advances in computing that have taken placesince then have actually created an increasingly high demand for theoreticalanalysis, when some of my colleagues had predicted there would actually bemuch less.I’m going to begin, in Section 15.2, by trying to capture some of the recentconcerns I have heard about the directions being taken by statistics today, andby indicating why I generally do not share the anxiety. To some extent, I feel,the concerns exist only if we try to resist change that we should accept asexciting and stimulating, rather than as a threat.Also in Section 15.2 I’ll try to set this disquiet against the background ofthe many changes that stimulated my work, particularly from the late 1970sto the early 1990s, and argue that the changes we are seeing today are in parta continuation of the many technological advances that have taken place overa long period. The changes add to, rather than subtract from, our field.In Section 15.3, with the reader’s indulgence I’ll focus more sharply on myown experience — on how I came to be a theoretical statistician, and howtheory has always guided my intuition and led, somewhat indirectly, to mywork on computer-intensive statistical methods. I know that, for many of mycolleagues in Australia and elsewhere, my “cart” of theoretical analysis comesbefore their “horse” of statistical computing. The opportunity to write thisshort chapter gives me a chance of explaining to them what I’ve been doingall these years, and why.


P.G. Hall 15915.2 Living with change15.2.1 The diversification of statisticsJust a few years ago I had a conversation with a colleague who expressed graveconcern for the future of statistics. He saw it being taken over by, or subsumedinto, fields as diverse and disparate as computer science and bioinformatics,to name only two. He was worried, and wondered what we could do to stopthe trend.Similar disquiet has been articulated by others, and not just recently. Theeminent British statistician D.J. Finney expressed apprehensions similar tothose of my colleague, although at a time when my colleague had not detectedmuch that was perceptibly wrong. Writing in the newsletter of the RoyalStatistical Society (RSS News) in2000,ProfessorFinneyarguedthat:“... professional statisticians may be losing control of — perhaps evenlosing concern for — what is done in the name of our discipline. I began[this article] by asking ‘Whither... [statistics]?’ My answer is ‘Downhill!’Any road back will be long and tortuous, but unless we find it wefail to keep faith with the lead that giants gave us 75 years ago.”I’m not sure whether it is pragmatism or optimism that keeps me from worryingabout these issues — pragmatism because, even if these portents ofcalamity were well founded, there would not be much that we could do aboutthem, short of pretending we had the powers of Cnut the Great and commandingthe tide of statistical change to halt; or optimism, on the groundsthat these changes are actually healthy, and more likely to enrich statisticsthan destroy it.In 1986 the UCLA historian Theodore Porter wrote that:“Statistics has become known in the twentieth century as the mathematicaltool for analysing experimental and observational data. Enshrinedby public policy as the only reliable basis for judgements suchas the efficacy of medical procedures or the safety of chemicals, andadopted by business for such uses as industrial quality control, it isevidently among the products of science whose influence on public andprivate life has been most pervasive.”The years since then have only deepened the involvement of statistics in science,technology, social science and culture, so that Porter’s comments aboutthe 20th Century apply with even greater force in the early 21st. Hal Varian’sfamous remark in The McKinsey Quarterly, inJanuary2009,that“thesexy job in the next ten years will be statisticians,” augments and reinforcesPorter’s words almost 30 years ago. Statistics continues to be vibrant andvital, I think because of, not despite, being in a constant state of change.


160 Exciting timesIn general I find that the changes lamented by Finney, and applauded byVarian, invigorate and energize our field. I’m not troubled by them, exceptfor the fact that they can make it more challenging to get funding for positionsin statistics, and more generally for research and teaching in the field.Indeed, it is not just in Australia, but across the globe, that the funding poolthat is notionally allocated to statistics is being challenged by many differentmulti-disciplinary pressures, to such an extent that financial support for coreresearch and teaching in statistics has declined in many cases, at least relativeto the increasing size of our community.Funding is flowing increasingly to collaborative research-centre type activities,where mathematical scientists (including statisticians) are often notinvolved directly at all. If involved, they are often present as consultants,rather than as true collaborators sharing in the funding. This is the maindanger that I see, for statisticians, in the diversification of statistics.15.2.2 Global and local revolutionsThe diversification has gone hand in hand with a revolution, or rather severalrevolutions, that have resulted in the past from the rapid development of inexpensivecomputing power since the 1970s, and from technologies that have ledto an explosion in machine recorded data. Indeed, we might reasonably thinkthat massive technological change has altered our discipline mainly throughthe ways our data are generated and the methods we can now use to analysethem.However, while those changes are very important, they are perhaps minorwhen set against the new questions that new sorts of data, and new computationaltools, enable scientists to ask, and the still newer data types, anddata analyses, that they must address in order to respond to those questions.Statisticians, and statistical methods, are at the heart of exploring these newpuzzles and clarifying the new directions in which they point.Some aspects of the revolutions are “global,” in the sense that, althoughthe motivation might come from a particular area of application, the resultingchanges influence many fields. Others are more local; their impact is notso widespread, and sometimes does not stray terribly far from the area ofapplication that first motivated the new developments.During the last 30 years or so we have seen examples of both global and localrevolutions in statistical methodology. For example, Efron’s (1979a) bootstrapwas definitely global. Although it arguably had its origins in methodologyfor sample survey data, for example in work of Hubback (1946), Mahalanobis(1946), Kish (1957), Guerney (1963), and McCarthy (1966, 1969), ithas arguably touched all fields of statistics. [Hall (2003) gave a brief account ofthe prehistory of the bootstrap.] Statistical “revolutions” that are more local,in terms of influence, include work during the 1980s on image analysis, andmuch of today’s statistical research on very high-dimensional data analysis.


P.G. Hall 161Some of this work is having as much influence on statistics itself as onthe fields of science and technology that motivated it, and some of that influenceis particularly significant. For example, we have benefited, and are stillbenefiting, from appreciating that entropy is a metric that can be used to assessthe performance of statistical smoothing in general, non-imaging contexts(Donoho et al., 1992). And we are using linear models to select variables inrelatively low-dimensional settings, not just for high-dimensional data. (Thelinear model is, after all, a wonderful parametric surrogate for monotonicity ingeneral contexts.) Some of these new methodologies have broken free from statisticalgravity and risen above the turmoil of other developments with whichthey are associated. They include, in the setting of modern high-dimensionaldata analysis, the lasso and basis pursuit; see, e.g., Tibshirani (1996, 2014a,b)and Chen et al. (1998).15.3 Living the revolution15.3.1 A few words to the readerA great deal of Section 15.3, particularly Section 15.3.2, is going to be aboutme, and for that I must apologize. Please feel free to skip Section 15.3.2, andcome back in at Section 15.3.3.15.3.2 A little backgroundEinstein’s (1934) advocacy of mathematics as a pillar for creative reasoningin science, or at least in physics, has been cited widely:“... experience of course remains the sole criterion of the serviceabilityof a mathematical construction for physics, but the truly creativeprinciple resides in mathematics.”I’m not going to suggest for a moment that all statisticians would agree thatEinstein’s argument is valid in their field. In fact, I know some who regardtheory as a downright hindrance to the way they do research, although itworks for me. However, let’s begin at the beginning.I spent my high school days happily, at a very good government schoolin Sydney. However, the school’s science curriculum in the 1960s effectivelypermitted no experimental work for students. This saved money, and sincethere were no state-wide exams in practical work then the effects of thatprivation were not immediately apparent. It meant that my study of sciencewas almost entirely theoretical, and that suited me just fine. But it didn’tprepare me well for my first year at university, where in each of biology,chemistry and physics I had several hours of practical work each week.


162 Exciting timesI entered the University of Sydney as an undergraduate in early 1970,studying for a science degree and convinced I would become a theoreticalphysicist, but I hadn’t bargained for all the lab work I would have to do beforeI reached that goal. Moreover, the mathematics in my physics classes was notrigorous enough for me. Asymptotic mathematical formulae abounded, butthey were never proved “properly.” I quickly came to the conclusion that itwould be better for all concerned, including those around me (I once floodedthe physics lab), if I devoted my energies to mathematics.However, for my second year I needed three subjects. I was delighted tobe able to study pure mathematics and applied mathematics in separate programs,not just one as had been the case in first year. But I had to find a thirdsubject, one that didn’t require any lab work. I saw in the student handbookthat there was a subject called mathematical statistics, which in those days,at the University of Sydney, started as a major in second year — so I hadn’tmissed anything in my first year. And it had “mathematical” in its title, soIfeltitwouldprobablysuit.Indeed it did, particularly the course on probability from Feller’s wonderfulVolume I (Feller, 1968). In second and third years the mathematical statisticscourse was streamed into “pass” and “honors” levels, and for the latter I hadto take extra lectures, which were invariably at a high mathematical level andwhich I found fascinating. I even enjoyed the classes that led to Ramsey’stheorem, although I could not any longer reproduce the proof!I took a course on measure theory in the third year pure mathematicscurriculum, and it prepared me very well for a fourth year undergraduatemathematical statistics course in probability theory, based on Chung’s graduatetext (Chung, 1968). That course ran for a full year, three lectures a week,and I loved both it and Chung’s book.I appreciate that the book is not for everyone. Indeed, more than a fewgraduate students have confided to me how difficult they have found it to getinto that text. But for me it had just the right mix of intuition, explanation,and leave-it-up-to-the-reader omissions to keep my attention. Chung’s stylecaptivated me, and I’m pleased to see that I’m still not alone. (I’ve just readthe five reviews of the third edition on Amazon, and I’m delighted to say thateach of them gives the book five stars.)Another attraction of the course was that I could give every third lecturemyself. It turned out I was the only student in the course, although a veryamiable logician, Gordon Monro from the Department of Pure Mathematics,also attended. The two of us, and our assigned lecturer Malcolm Quine, sharedthe teaching among us. It was a wonderful experience. It gave me a lifelonglove of probability, and also of much of the theory that underpins statistics.Now let’s fast forward to late 1976, when I returned to Australia fromthe UK after completing my doctorate in probability. I had a short-term contractjob at the University of Melbourne, with no opportunity in 1976 foranything more permanent. In particular, I was having significant difficultyfinding a longer-term position in probability. So I applied for any job that


P.G. Hall 163had some connection to probability or theoretical statistics, including one atthe Australian National University that was advertised with a preference forabiometrician.This was in 1977, and I assume that they had no plausible applicants inbiometrics, since they offered the position to me. However, the Head of Department,Chip Heathcote, asked me (quite reasonably) to try to migrate myresearch interests from probability to statistics. (Biometrics wasn’t required.)He was very accommodating and nice about it, and in particular there wasno deadline for making the switch.I accepted the position and undertook to make the change, which I foundthat I quite enjoyed. On my reckoning, it took me about a decade to movefrom probability to statistics, although some of my colleagues, who perhapswouldn’t appreciate that Einstein’s remark above, about physics, might applyto statistics as well, would argue that I have still got a way to go. I eased myselfinto statistics by taking what I might call the “contemporary nonparametric”route, which I unashamedly admit was much easier than proceeding along aparametric path.At least in the 1970s and 1980s, much of nonparametric statistics (functionestimation, the jackknife, the bootstrap, etc) was gloriously ad hoc. Thebasic methodology, problems and concepts (kernel methods, bias and varianceestimation, resampling, and so forth) were founded on the fact that theymade good intuitive sense and could be justified theoretically, for examplein terms of rates of convergence. To undertake this sort of research it wasnot necessary to have at your fingertips an extensive appreciation of classicalstatistical foundations, based for example on sufficiency and efficiency andancillarity and completeness and minimum variance unbiasedness. Taking anonparametric route, I could start work immediately. And it was lots of fun.I should mention, for the benefit of any North American readers who havegot this far, that in Australia at that time there was virtually no barrierbetween statistics and probability. Practitioners of both were in the samedepartment, typically a Department of Statistics, and a Mathematics Departmentwas usually devoid of probabilists (unless the department also housedstatisticians). This was one of many structures that Australian universitiesinherited from the United Kingdom, and I have always found it to be an attractive,healthy arrangement. However, given my background you’d probablyexpect me to have this view.The arrangement persists to a large extent today, not least because manyAustralian Statistics Departments amalgamated with Mathematics Departmentsduring the budget crises that hit universities in the mid to late 1990s.Amodernexception,morecommontodaythanthirtyyearsago,isthatstrongstatistics groups exist in some economics or business areas in Australian universities,where they have little contact with probability.


164 Exciting times15.3.3 Joining the revolutionSo, this is how I came to statistics — by necessity, with employment in mindand having a distinct, persistently theoretical outlook. A paper on nonparametricdensity estimation by Eve Bofinger (1975), whom I’d met in 1974 whileI was an MSc student, drew a connection for me between the theory of orderstatistics and nonparametric function estimation, and gave me a start therein the late 1970s.I already had a strong interest in rates of convergence in the central limittheorem, and in distribution approximations. That gave me a way, in the1980s, of accessing theory for the bootstrap, which I found absolutely fascinating.All these methodologies — function estimation, particularly techniques forchoosing tuning parameters empirically, and of course the bootstrap — werepart of the “contemporary nonparametric” revolution in the 1970s and 1980s.It took off when it became possible to do the computation. I was excited tobe part of it, even if mainly on the theoretical side. In the early 1980s a seniorcolleague, convinced that in the future statistical science would be developedthrough computer experimentation, and that the days of theoretical work instatistics were numbered, advised me to discontinue my interests in theoryand focus instead on simulation. However, stubborn as usual, I ignored him.It is curious that the mathematical tools used to develop statistical theorywere regarded firmly as parts of probability theory in the 1970s, 80s and eventhe 90s, whereas today they are seen as statistical. For example, recent experienceserving on an IMS committee has taught me that methods built aroundempirical processes, which were at the heart of wondrous results in probabilityin the 1970s (see, e.g., Komlós et al., 1975, 1976), are today seen by more thanafewprobabilistsasdistinctlystatisticalcontributions.Convergenceratesinthe central limit theorem are viewed in the same light. Indeed, most resultsassociated with the central limit theorem seem today to be seen as statistical,rather than probabilistic. (So, I could have moved from probability tostatistics simply by standing still, while time washed over me!)This change of viewpoint parallels the “reinterpretation” of theory forspecial functions, which in the era of Whittaker and Watson (1902), and indeedalso of the more widely used fourth edition in 1927, was seen as theoreticalmathematics, but which, by the advent of Abramowitz and Stegun (1964)(planning for that volume had commenced as early as 1954), had migrated tothe realm of applied mathematics.Throughout all this work in nonparametric statistics, theoretical developmentwas my guide. Using it hand in hand with intuition I was able to gomuch further than I could have managed otherwise. This has always been myapproach — use theory to augment intuition, and allow them to work togetherto elucidate methodology.Function estimation in the 1970s and 1980s had, to a theoretician likemyself, a fascinating character. Today we can hardly conceive of constructing


P.G. Hall 165a nonparametric function estimator without also estimating an appropriatesmoothing parameter, for example a bandwidth, from the data. But in the1970s, and indeed for part of the 80s, that was challenging to do withoutusing a mainframe computer in another building and waiting until the nextday to see the results. So theory played a critical role.For example, Mike Woodroofe’s paper (Woodroofe, 1970) on asymptoticproperties of an early plug-in rule for bandwidth choice was seminal, and wasrepresentative of related theoretical contributions over the next decade or so.Methods for smoothing parameter choice for density estimation, using crossvalidationand suggested by Habemma et al. (1974) in a Kullback–Leiblersetting, and by Rudemo (1982) and Bowman (1984) for least squares, werechallenging to implement numerically at the time they were introduced, especiallyin Monte Carlo analysis. However, they were explored enthusiasticallyand in detail using theoretical arguments; see, e.g., Hall (1983, 1987) andStone (1984).Indeed, when applied to a sample of size n, cross-validation requires O(n 2 )computations, and even for moderate sample sizes that could be difficult in asimulation study. We avoided using the Gaussian kernel because of the sheercomputational labour required to compute an exponential. Kernels based ontruncated polynomials, for example the Bartlett–Epanechnikov kernel and thebiweight, were therefore popular.In important respects the development of bootstrap methods was no different.For example, the double bootstrap was out of reach, computationally,for most of us when it was first discussed (Hall, 1986; Beran, 1987, 1988).Hall (1986, p. 1439) remarked of the iterated bootstrap that “it could not beregarded as a general practical tool.” Likewise, the computational challengesposed by even single bootstrap methods motivated a variety of techniquesthat aimed to provide greater efficiency to the operation of sampling froma sample, and appeared in print from the mid 1980s until at least the early1990s. However, efficient methods for bootstrap simulation are seldom usedtoday, so plentiful is the computing power that we have at our disposal.Thus, for the bootstrap, as for problems in function estimation, theoryplayed a role that computation really couldn’t. Asymptotic arguments pointedauthoritatively to the advantages of some bootstrap techniques, and to thedrawbacks associated with others, at a time when reliable numerical corroborationwas hard to come by. The literature of the day contains muted versionsof some of the exciting discussions that took place in the mid to late 1980son this topic. It was an extraordinary time — I feel so fortunate to have beenworking on these problems.I should make the perhaps obvious remark that, even if it had been possibleto address these issues in 1985 using today’s computing resources, theorystill would have provided a substantial and unique degree of authority to thedevelopment of nonparametric methods. In one sweep it enabled us to addressissues in depth in an extraordinarily broad range of settings. It allowed us todiagnose and profoundly understand many complex problems, such as the high


166 Exciting timesvariability of a particular method for bandwidth choice, or the poor coverageproperties of a certain type of bootstrap confidence interval. I find it hard tobelieve that numerical methods, on their own, will ever have the capacity todeliver the level of intuition and insightful analysis, with such breadth andclarity, that theory can provide.Thus, in those early developments of methodology for function estimationand bootstrap methods, theory was providing unrivaled insights into newmethodology, as well as being ahead of the game of numerical practice. Methodswere suggested that were computationally impractical (e.g., techniquesfor bandwidth choice in the 1970s, and iterated bootstrap methods in the1980s), but they were explored because they were intrinsically attractive froman intuitive viewpoint. Many researchers had at least an impression that themethods would become feasible as the cost of computing power decreased, butthere was never a guarantee that they would become as easy to use as they aretoday. Moreover, it was not with a view to today’s abundant computing resourcesthat those computer-intensive methods were proposed and developed.To some extent their development was unashamedly an intellectual exercise,motivated by a desire to push the limits of what might sometime be feasible.In adopting this outlook we were pursuing a strong precedent. For example,Pitman (1937a,b, 1938), following the lead of Fisher (1935, p. 50), suggestedgeneral permutation test methods in statistics, well in advance of computingtechnology that would subsequently make permutation tests widely applicable.However, today the notion that we might discuss and develop computerintensivestatistical methodology, well ahead of the practical tools for implementingit, is often frowned upon. It strays too far, some colleagues argue,from the practical motivation that should underpin all our work.I’m afraid I don’t agree, and I think that some of the giants of the past,perhaps even Fisher, would concur. The nature of revolutions, be they instatistics or somewhere else, is to go beyond what is feasible today, and devisesomething remarkable for tomorrow. Those of us who have participated insome of the statistics revolutions in the past feel privileged to have beenpermitted free rein for our imaginations.AcknowledgementsI’m grateful to Rudy Beran, Anirban Dasgupta, Aurore Delaigle, and Jane-Ling Wang for helpful comments.


P.G. Hall 167ReferencesAbramowitz, M. and Stegun, I.A., Eds. (1964). Handbook of MathematicalFunctions with Formulas, Graphs, and Mathematical Tables,NationalBureauof Standards, Washington, DC.Beran, R. (1987). Prepivoting to reduce level error in confidence sets.Biometrika, 74:457–468.Beran, R. (1988). Prepivoting test statistics: A bootstrap view of asymptoticrefinements. Journal of the American Statistical Association, 83:687–697.Bofinger, E. (1975). Estimation of a density function using order statistics.Australian Journal of Statistics, 17:1–7.Bowman, A.W. (1984). An alternative method of cross-validation for thesmoothing of density estimates. Biometrika, 71:353–360.Chen, S., Donoho, D., and Saunders, M. (1998). Atomic decomposition bybasis pursuit. SIAM Journal of Scientific Computing, 20:33–61.Chung, K.L. (1968). ACourseinProbabilityTheory.Harcourt,Brace&World, New York.Cox, D.R. (1972). Regression models and life-tables (with discussion). Journalof the Royal Statistical Society, Series B, 34:187–220.Donoho, D., Johnstone, I.M., Hoch, J.C., and Stern, A.S. (1992). Maximumentropy and the nearly black object (with discussion). Journal of theRoyal Statistical Society, Series B, 54:41–81.Efron, B. (1979a). Bootstrap methods: Another look at the jackknife. TheAnnals of Statistics, 7:1–26.Efron, B. (1979b). Computers and the theory of statistics: Thinking theunthinkable. SIAM Review, 21:460–480.Einstein, A. (1934). On the method of theoretical physics. Philosophy ofScience, 1:163–169.Feller, W. (1968). An Introduction to Probability Theory and its Applications,Vol. I, 3rd edition. Wiley, New York.Fisher, R.A. (1935). The Design of Experiments. Oliver&Boyd,Edinburgh.Guerney, M. (1963). The Variance of the Replication Method for EstimatingVariances for the CPS Sample Design. Unpublished Memorandum, USBureau of the Census, Washington, DC.


168 Exciting timesHabemma, J.D.F., Hermans, J., and van den Broek, K. (1974). A stepwisediscriminant analysis program using density estimation. In Proceedings inComputational Statistics, COMPSTAT’1974, PhysicaVerlag,Heidelberg,pp. 101–110.Hall, P. (1983). Large sample optimality of least squares cross-validation indensity estimation. The Annals of Statistics, 11:1156–1174.Hall, P. (1986). On the bootstrap and confidence intervals. The Annals ofStatistics, 14:1431–1452.Hall, P. (1987). On Kullback–Leibler loss and density estimation. The Annalsof Statistics, 15:1491–1519.Hall, P. (2003). A short prehistory of the bootstrap. Statistical Science,18:158–167.Hubback, J. (1946). Sampling for rice yield in Bihar and Orissa. Sankhyā,7:281–294. (First published in 1927 as Bulletin 166, Imperial AgriculturalResearch Institute, Pusa, India).Kish, L. (1957). Confidence intervals for clustered samples. American SociologicalReview, 22:154–165.Komlós, J., Major, P., and Tusnády, G. (1975). An approximation of partialsums of independent RV’s and the sample DF. I. Zeitschrift fürWahrscheinlichkeitstheorie und Verwandte Gebiete, 32:111–131.Komlós, J., Major, P., and Tusnády, G. (1976). An approximation of partialsums of independent RV’s and the sample DF. II. Zeitschrift fürWahrscheinlichkeitstheorie und Verwandte Gebiete, 34:33–58.Mahalanobis, P. (1946). Recent experiments in statistical sampling in theIndian Statistical Institute (with discussion). Journal of the Royal StatisticalSociety, 109:325–378.McCarthy, P. (1966). Replication: An Approach to the Analysis of Data FromComplex Surveys. Vital Health Statistics. Public Health Service Publication1000, Series 2, No. 14, National Center for Health Statistics, PublicHealth Service, US Government Printing Office.McCarthy, P. (1969). Pseudo-replication: Half samples. Review of the InternationalStatistical Institute, 37:239–264.Pitman, E.J.G. (1937a). Significance tests which may be applied to samplesfrom any population. Royal Statistical Society Supplement, 4:119–130.Pitman, E.J.G. (1937b). Significance tests which may be applied to samplesfrom any population, II. Royal Statistical Society Supplement, 4:225–232.


P.G. Hall 169Pitman, E.J.G. (1938). Significance tests which may be applied to samplesfrom any population. Part III. The analysis of variance test. Biometrika,29:322–335.Rudemo, M. (1982). Empirical choice of histograms and kernel density estimators.Scandinavian Journal of Statistics, 9:65–78.Stone, C.J. (1984). An asymptotically optimal window selection rule for kerneldensity estimates. The Annals of Statistics, 12:1285–1297.Tibshirani, R.J. (1996). Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288.Tibshirani, R.J. (2014a). In praise of sparsity and convexity. Past, Present,and Future of Statistical Science (X. Lin, C. Genest, D.L. Banks, G.Molenberghs, D.W. Scott, and J.-L. Wang, Eds.). Chapman & Hall, London,pp. 497–505.Tibshirani, R.J. (2014b). Lasso and sparsity in statistics. Statistics in Action:ACanadianOutlook(J.F. Lawless, Ed.). Chapman & Hall, London, pp.79–91.Whittaker, E.T. and Watson, G.N. (1902). ACourseofModernAnalysis.Cambridge University Press, Cambridge, UK.Woodroofe, M. (1970). On choosing a delta-sequence. The Annals of MathematicalStatistics, 41:1665–1671.


16The bright future of applied statisticsRafael A. IrizarryDepartment of Biostatistics and Computational Biology,Dana-Farber Cancer InstituteandDepartment of BiostatisticsHarvard School of Public Health, Boston, MA16.1 IntroductionWhen I was asked to contribute to this book, titled Past, Present, and Futureof Statistical Science, Icontemplatedmycareerwhiledecidingwhattowriteabout. One aspect that stood out was how much I benefited from the rightcircumstances. I came to one clear conclusion: it is a great time to be anapplied statistician. I decided to describe the aspects of my career that I havethoroughly enjoyed in the past and present and explain why this has led meto believe that the future is bright for applied statisticians.16.2 Becoming an applied statisticianI became an applied statistician while working with David Brillinger on myPhD thesis. When searching for an advisor I visited several professors andasked them about their interests. David asked me what I liked and all I cameup with was “I don’t know. Music?” to which he responded, “That’s whatwe will work on.” Apart from the necessary theorems to get a PhD from theStatistics Department at Berkeley, my thesis summarized my collaborativework with researchers at the Center for New Music and Audio Technology. Thework involved separating and parameterizing the harmonic and non-harmoniccomponents of musical sound signals (Irizarry, 2001). The sounds had beendigitized into data. The work was indeed fun, but I also had my first glimpse171


172 Bright future of applied statisticsinto the incredible potential of statistics in a world becoming more and moredata-driven.Despite having expertise only in music, and a thesis that required a CDplayer to hear the data, fitted models and residuals (http://www.biostat.jhsph.edu/~ririzarr/Demo/index.html), I was hired by the Department ofBiostatistics at Johns Hopkins School of Public Health. Later I realized whatwas probably obvious to the school’s leadership: that regardless of the subjectmatter of my thesis, my time series expertise could be applied to several publichealth applications (Crone et al., 2001; DiPietro et al., 2001; Irizarry, 2001).The public health and biomedical challenges surrounding me were simply toohard to resist and my new department knew this. It was inevitable that I wouldquickly turn into an applied biostatistician.16.3 Genomics and the measurement revolutionSince the day that I arrived at Johns Hopkins University 15 years ago, ScottZeger, the department chair, fostered and encouraged faculty to leverage theirstatistical expertise to make a difference and to have an immediate impact inscience. At that time, we were in the midst of a measurement revolution thatwas transforming several scientific fields into data-driven ones. Being locatedin a School of Public Health and next to a medical school, we were surroundedby collaborators working in such fields. These included environmental science,neuroscience, cancer biology, genetics, and molecular biology. Much of mywork was motivated by collaborations with biologists that, for the first time,were collecting large amounts of data. Biology was changing from a data poordiscipline to a data intensive one.A specific example came from the measurement of gene expression. Geneexpression is the process in which DNA, the blueprint for life, is copied intoRNA, the templates for the synthesis of proteins, the building blocks for life.Before microarrays were invented in the 1990s, the analysis of gene expressiondata amounted to spotting black dots on a piece of paper (Figure 16.1, left).With microarrays, this suddenly changed to sifting through tens of thousandsof numbers (Figure 16.1, right). Biologists went from using their eyes to categorizeresults to having thousands (and now millions) of measurements persample to analyze. Furthermore, unlike genomic DNA, which is static, geneexpression is a dynamic quantity: different tissues express different genes atdifferent levels and at different times. The complexity was exacerbated by unpolishedtechnologies that made measurements much noisier than anticipated.This complexity and level of variability made statistical thinking an importantaspect of the analysis. The biologists that used to say, “If I need statistics, theexperiment went wrong” were now seeking out our help. The results of thesecollaborations have led to, among other things, the development of breast can-


R.A. Irizarry 173FIGURE 16.1Illustration of gene expression data before and after microarrays.cer recurrence gene expression assays making it possible to identify patientsat risk of distant recurrence following surgery (van’t Veer, 2002).When biologists at Johns Hopkins first came to our department for helpwith their microarray data, Scott put them in touch with me because I hadexperience with (what was then) large datasets (digitized music signals arerepresented by 44,100 points per second). The more I learned about the scientificproblems and the more data I explored, the more motivated I became.The potential for statisticians having an impact in this nascent field was clearand my department was encouraging me to take the plunge. This institutionalencouragement and support was crucial as successfully working in thisfield made it harder to publish in the mainstream statistical journals; an accomplishmentthat had traditionally been heavily weighted in the promotionprocess. The message was clear: having an immediate impact on specific scientificfields would be rewarded as much as mathematically rigorous methodswith general applicability.As with my thesis applications, it was clear that to solve some of the challengesposed by microarray data I would have to learn all about the technology.For this I organized a sabbatical with Terry Speed’s group in Melbournewhere they helped me accomplish this goal. During this visit I reaffirmed mypreference for attacking applied problems with simple statistical methods, asopposed to overcomplicated ones or developing new techniques. Learning thatdiscovering clever ways of putting the existing statistical toolbox to work wasgood enough for an accomplished statistician like Terry gave me the necessaryconfidence to continue working this way. More than a decade later thiscontinues to be my approach to applied statistics. This approach has beeninstrumental for some of my current collaborative work. In particular, it ledto important new biological discoveries made together with Andy Feinberg’slab (Irizarry, 2009).


174 Bright future of applied statisticsDuring my sabbatical we developed preliminary solutions that improvedprecision and aided in the removal of systematic biases from microarray data(Irizarry, 2003). I was aware that hundreds, if not thousands, of other scientistswere facing the same problematic data and were searching for solutions. ThereforeI was also thinking hard about ways in which I could share whatever solutionsI developed with others. During this time I received an email from RobertGentleman asking if I was interested in joining a new software project for thedelivery of statistical methods for genomics data. This new collaboration eventuallybecame the Bioconductor project (http://www.bioconductor.org),which to this day continues to grow its user and developer base (Gentlemanet al., 2004). Bioconductor was the perfect vehicle for having the impact thatmy department had encouraged me to seek. With Ben Bolstad and otherswe wrote an R package that has been downloaded tens of thousands of times(Gautier et al., 2004). Without the availability of software, the statisticalmethod would not have received nearly as much attention. This lesson servedme well throughout my career, as developing software packages has greatlyhelped disseminate my statistical ideas. The fact that my department andschool rewarded software publications provided important support.The impact statisticians have had in genomics is just one example ofour field’s accomplishments in the 21st century. In academia, the numberof statisticians becoming leaders in fields such as environmental sciences,human genetics, genomics, and social sciences continues to grow. Outsideof academia, sabermetrics has become a standard approach in severalsports (not just baseball) and inspired the Hollywood movie MoneyBall. A PhD statistician led the team that won the Netflix million dollarprize (http://www.netflixprize.com/). Nate Silver (http://mashable.com/2012/11/07/nate-silver-wins/) proved the pundits wrong by onceagain using statistical models to predict election results almost perfectly.R has become a widely used programming language. It is no surprise thatstatistics majors at Harvard have more than quadrupled since 2000 (http://nesterko.com/visuals/statconcpred2012-with-dm/] and that statisticsMOOCs are among the most popular (http://edudemic.com/2012/12/the-11-most-popular-open-online-courses/).The unprecedented advance in digital technology during the second halfof the 20th century has produced a measurement revolution that is transformingscience. Scientific fields that have traditionally relied upon simpledata analysis techniques have been turned on their heads by these technologies.Furthermore, advances such as these have brought about a shift fromhypothesis-driven to discovery-driven research. However, interpreting informationextracted from these massive and complex datasets requires sophisticatedstatistical skills as one can easily be fooled by patterns that arise bychance. This has greatly elevated the importance of our discipline in biomedicalresearch.


R.A. Irizarry 17516.4 The bright futureI think that the data revolution is just getting started. Datasets are currentlybeing, or have already been, collected that contain, hidden in their complexity,important truths waiting to be discovered. These discoveries will increasethe scientific understanding of our world. Statisticians should be excited andready to play an important role in the new scientific renaissance driven by themeasurement revolution.ReferencesCrone, N.E., Hao, L., Hart, J., Boatman, D., Lesser, R.P., Irizarry, R., andGordon, B. (2001). Electrocorticographic gamma activity during wordproduction in spoken and sign language. Neurology, 57:2045–2053.DiPietro, J.A., Irizarry, R.A., Hawkins, M., Costigan, K.A., and Pressman,E.K. (2001). Cross-correlation of fetal cardiac and somatic activity as anindicator of antenatal neural development. American Journal of Obstetricsand Gynecology, 185:1421–1428.Gautier, L., Cope, L., Bolstad, B.M., and Irizarry, R.A. (2004). Affyanalysisof affymetrix genechip data at the probe level. Bioinformatics, 20:307–315.Gentleman, R.C., Carey, V.J., Bates, D.M., Bolstad, B., Dettling, M., Dudoit,S., Ellis, B., Gautier, L., Ge, Y., Gentry, J. et al. (2004). Bioconductor:Open software development for computational biology andbioinformatics. Genome Biology, 5:R80.Irizarry, R.A. (2001). Local harmonic estimation in musical sound signals.Journal of the American Statistical Association, 96:357–367.Irizarry, R.A., Hobbs, B., Collin, F., Beazer-Barclay, Y.D., Antonellis, K.J.,Scherf, U., and Speed, T.P. (2003). Exploration, normalization, and summariesof high density oligonucleotide array probe level data. Biostatistics,4:249–264.Irizarry, R.A., Ladd-Acosta, C., Wen, B., Wu, Z., Montano, C., Onyango, P.,Cui, H., Gabo, K., Rongione, M., Webster, M. et al. (2009). The humancolon cancer methylome shows similar hypo- and hypermethylation atconserved tissue-specific CpG island shores. Nature Genetics, 41:178–186.Irizarry, R.A., Tankersley, C., Frank, R., and Flanders, S. (2001). Assessinghomeostasis through circadian patterns. Biometrics, 57:1228–1237.


176 Bright future of applied statisticsvan’t Veer, L.J., Dai, H., Van De Vijver, M.J., He, Y.D., Hart, A.AM., Mao,M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T. etal. (2002). Gene expression profiling predicts clinical outcome of breastcancer. Nature, 415:530–536.


17The road travelled: From statistician tostatistical scientistNilanjan ChatterjeeBiostatistics Branch, Division of Cancer Epidemiology and GeneticsNational Cancer InstituteNational Institutes of HealthDepartment of Health and Human ServicesIn this chapter, I will attempt to describe how a series of problems in statisticalgenetics, starting from a project about estimation of risk for BRCA1/2mutation carriers, have driven a major part of my research at the NationalCancer Institute over the last fourteen years. I try to share some of the statisticaland scientific perspectives in this field that I have developed over theyears during which I myself have transformed from a theoretical statisticianto a statistical scientist. I hope my experience would draw the attention ofother statisticians, especially young researchers who have perceived strengthin statistical theory but are keen on using their knowledge to advance science,about the tremendous opportunity that lies ahead.17.1 IntroductionI have been wondering for a while now what would I like to share with thereaders about my experience as a statistician. Suddenly, a news event that hascaptivated the attention of the world in the last few days gave me the impetusto start on this piece. Angelina Jolie, the beautiful and famous Hollywood actress,also well-known for her humanitarian work across the globe, has madethe bold decision to opt for bilateral mastectomy after knowing that she carriesa mutation in the gene called BRCA1 which, according to her doctors,gives her a risk of 87% for developing breast cancers and 50% for developingovarian cancer (Jolie, May 14, 2013). The print and online media, blogosphereand social media sites are all buzzing with the discussion about her courageof not only making the drastic decision to take control of her own health, but


178 The road travelledalso to help increase awareness about the issues by sharing the deeply personalstory with the general public. At the same time, a debate is also raging aboutwhether she got the right estimates of her risks and whether she could havemade alternative, less drastic choices, to minimize and manage her long-termrisks of these two cancers. And of course, all of these are embedded in a muchbroader debate in the community about the appropriate use of comprehensivegenome sequencing information for personalized medicine (treatment andprevention) in the future.17.2 Kin-cohort study: My gateway to geneticsThis debate about Angelina Jolie’s decision is taking me back through thememory lane to May of 1999 when I joined the Division of Cancer Epidemiologyand Genetics of the National Cancer Institute as a post-doctoral fellowafter finishing my PhD in Statistics from the University of Washington,Seattle. The very first project my mentor, Sholom Wacholder, introduced meto involved estimation of risk of certain cancers, including those of breastand ovary, associated with mutations in genes BRCA1 and BRCA2 from theWashington Ashkenazi Study (WAS) (Struewing et al., 1997). My mentorand his colleagues had previously developed the novel “kin-cohort” approachthat allowed estimation of age-specific cumulative risk of a disease associatedwith a genetic mutation based on the disease history of the set of relatives ofgenotyped study participants (Wacholder et al., 1998). This approach, whenapplied to the WAS study, estimated the lifetime risk or penetrance of breastcancer to be between 50–60%, substantially lower than the estimates of penetrancearound 90–100% that have been previously obtained from analysisof highly disease enriched families. It was thought that WAS, which werevolunteer-based and not as prone to ascertainment bias as family-studies, providedmore unbiased estimate of risk for BRCA1/2 mutation carriers in thegeneral population. Other studies, that have employed “un-selected” designs,have estimated the penetrance to be even lower.The approach my mentor and colleagues had previously developed wasvery simple and elegant. It relied on the observation that since approximatelyhalf of the first-degree relatives of BRCA1/2 mutation carriers are expectedto be carriers by themselves due to Mendelian laws of transmission, the risksof the disease in this group of relatives should be approximately 50:50 mixtureof the risk of the disease associated with carriers and non-carriers. Further, forararemutation,sinceveryfewofthefirstdegreerelativesofthenon-carriersare expected to be carriers themselves, the risk of the disease in the group ofrelatives should be approximately the same as that for non-carriers themselves.Thus they employed a simple method-of-moment approach to estimate theage-specific cumulative risks associated with carriers and non-carriers using


N. Chatterjee 179the Kaplan–Meyer estimators of risk for the first-degree relative of carriersand those for non-carriers.I attempted to formulate the problem in terms of a composite likelihoodframework (Chatterjee and Wacholder, 2001) so that the resulting estimatorhas desirable statistical properties such as monotonicity of the age-specificcumulative risk curve and are robust to strong assumptions about residualfamilial correlation of disease among family members. The likelihood-basedframework was quite attractive due to its flexibility for performing variousadditional analyses and I was happy that I was able to make a quick methodologiccontribution, even being fairly novice to the field. However, the actualapplication of the method to WAS hardly changed the estimate of penetrancefor BRCA1/2 mutation compared to the method of moment estimates previouslyavailable.In retrospect, I learned several lessons from my first post-doctoral project.First, it is often hard to beat a simple but sensible statistical method. Althoughthis may be obvious to many seasoned applied statisticians, this first-handexperience was an important lesson for me, fresh out of graduate school wheremy PhD thesis involved understanding of semi-parametric efficient estimationmethodology, the purpose of which is to get the last drop of informationfrom the data with minimal assumption about “nuisance” parameters. Second,although the substantive contribution of my first project was modest, it wasan invaluable exercise for me as it opened my gateway to the whole new areaof statistical genetics. To get a solid grasp of the problem without having anyknowledge of genetics apriori, I had to teach myself concepts of populationas well as molecular genetics. Self-teaching and my related struggles were aninvaluable experience for me that help me to date to think about each problemin my own way.17.3 Gene-environment interaction: Bridginggenetics and theory of case-control studiesAs I was wrapping up my work on kin-cohort studies, one day Sholom askedme for some help to analyze data from a case-control study of Ovarian Cancerto assess interaction between BRCA1/2 mutations and certain reproductivefactors, such as oral contraceptive use, which are known to reduce the risk forthe disease in the general population (Modan et al., 2001). Because BRCA1/2mutations are very rare in the general population, standard logistic regressionanalysis of interaction would have been very imprecise. Instead, the investigatorsin this study were pursuing an alternative method that uses a log-linearmodeling approach that can incorporate the reasonable assumption that reproductivefactors are distributed independently of genetic mutation statusin the general population. Earlier work has shown that incorporation of the


180 The road travelledgene-environment independence assumption can greatly enhance the efficiencyof interaction analysis in case-control studies (Piegorsch et al., 1994; Umbachand Weinberg, 1997).While I was attempting to analyze the study using the log-linear modelframework, I realized it is a bit of a cumbersome approach that requires creatingmassive contingency tables by categorizing all of the variables under study,carefully tracking different sets of parameters (e.g., parameters related to diseaseodds-ratios and exposure frequency distributions) and then constrainingspecific parameters to zero for incorporation of the gene-environment independenceassumption. I quickly realized that all of these details can be greatlysimplified by some of the techniques I had learned from my advisors NormanBreslow and Jon Wellner during my PhD thesis regarding semi-parametricanalysis of case-control and other types of studies that use complex outcomedependent sampling designs (Chatterjee et al., 2003). In particular, I was ableto derive a profile-likelihood technique that simplifies fitting of logistic regressionmodels to case-control data under the gene-environment independenceassumption(Chatterjee and Carroll, 2005). Early in the development, I toldRay Carroll, who has been a regular visitor at NCI for a long time, aboutsome of the results I have derived, and he got everyone excited because of hisown interest and earlier research in this area. Since then Ray and I have beenpartners in crime in many papers related to inference on gene-environmentinteractions from case-control studies.This project also taught me a number of important lessons. First, there istremendous value to understanding the theoretical underpinning of standardmethods that we routinely use in practice. Without the understanding of thefundamentals of semi-parametric inference for analysis of case-control datathat I developed during my graduate studies, I would never have made theconnection of this problem to profile-likelihood, which is essentially the backbonefor many standard methods, such as Cox’s partial likelihood analysis oflifetime data. The approach not only provided a simplified framework for exploitingthe gene-environment independence assumption for case-control studies,but also led to a series of natural extensions of practical importance so thatthe method is less sensitive to the violation of the critical gene-environment independenceassumption. My second lesson was that it is important not to losethe applied perspective even when one is deeply involved in the developmentof theory. Because we paid close attention to the practical limitation of theoriginal methods and cutting-edge developments in genetic epidemiology, mycollaborators, trainee and I (Ray Carroll, Yi-Hau Chen, Bhramar Mukherjee,Samsiddhi Bhattacharjee to name a few) were able to propose robustextensions of the methodology using conditional logistic regression, shrinkageestimation techniques and genomic control methods (Chatterjee et al., 2005;Mukherjee and Chatterjee, 2008; Bhattacharjee et al., 2010).


N. Chatterjee 18117.4 Genome-wide association studies (GWAS):Introduction to big scienceBRCA1/2 mutations, which pose high-risks for breast and ovarian cancer butrare in the general population, were originally discovered in the early 1990sthrough linkage studies that involve analysis of the co-segregation of geneticmutations and disease within highly affected families (Hall et al., 1990). Fromthe beginning of the 21st century, after the human genome project got completedand large scale genotyping technologies evolved, the genetic communitystarted focusing on genome-wide association studies (GWAS). The purpose ofthese studies was to identify genetic variants which may pose more modestrisk of diseases, like breast cancer, but are more common in the general populations.Early in this effort, the leadership of our Division decided to launchtwo such studies, one for breast cancer and one for prostate cancer, under therubric of the Cancer Genetics Marker of Susceptibility of Studies (CGEMS)(Hunter et al., 2007; Yeager et al., 2007).Iparticipatedinthesestudiesasafour-memberteamofstatisticianswhoprovided the oversight of the quantitative issues in the design and analysisaspect of these studies. For me, this was my first exposure to large “team science,”where progress could only be made through collaborations of a team ofresearchers with diverse background, such as genomics, epidemiology, bioinformatics,and statistics. Getting into the nitty-gritty of the studies gave mean appreciation of the complexities of large scale genomic studies. I realizedthat while we statisticians are prone to focus on developing an “even moreoptimal” method of analysis, some of the most fundamental and interestingquantitative issues in these types of studies lies elsewhere, in particular in theareas of study design, quality control and characterization following discovery(see next section for more on the last topic).I started thinking seriously about study design when I was helping oneof my epidemiologic collaborators put together a proposal for conducting agenome-wide association study for lung cancer. As a principled statistician,Ifeltsomeresponsibilitytoshowthattheproposedstudyislikelytomakenew discoveries beyond three GWAS of lung cancer that were just publishedin high-profile journals such as Nature and Nature Genetics. Irealizedthatstandard power calculations, where investigators typically show that the studyhas 80–90% power to detect certain effect sizes, is not satisfactory for evergrowingGWA studies. I realized if I had to do a more intelligent power calculation,I first needed to make an assessment of what might be the underlyinggenetic architecture of the trait, in particular how many genetic variants mightbe associated with the trait and what are their effect-sizes.I made a very simple observation that the discoveries made in an existingstudy can be thought of as a random sample from the underlying “population”of susceptibility markers where the probability of discovery of any


182 The road travelledgiven marker is proportional to the power of the study at the effect-size associatedwith that marker. My familiarity with sample-survey theory, whichIdevelopedduringmyPhDthesisontwo-phasestudydesign,againcameveryhandy here. I worked with a post-doctoral Fellow, JuHyun Park, to developan “inverse-power-weighted” method, similar to the widely used “inverseprobability-weighted”(IPW) methods for analysis of survey data, for inferringthe number of underlying susceptibility markers for a trait and their effect-sizedistribution using published information on known discoveries and the studydesign of the underlying GWA studies (Park et al., 2010). We inferred geneticarchitecture for several complex traits using this method and made projectionsabout the expected number of discoveries in GWAS of these traits. Wehave been very pleased to see that our projections were quite accurate whenresults from larger and larger GWA studies have come out for these traitssince the publication of our report (Allen et al., 2010; Anderson et al., 2011;Eeles et al., 2013; Michailidou et al., 2013).Realizing how optimal study design is fundamentally related to the underlyinggenetic architecture of traits, both JuHyun and I continued to delveinto these related issues. Again using known discoveries from published studiesand information on design of existing studies, we showed that there isvery modest or no evidence of an inverse relationship between effect-size andallele frequency for genetic markers, a hypothesis in population genetics postulatedfrom a selection point of view and one that often has been used in thepast by scientists to motivate studies of less common and rare variants usingsequencing technologies (Park et al., 2011). From the design point of view,we conjectured that lack of strong relationship between allele frequency andeffect-size implies future studies for less common and rare variants will requireeven larger sample sizes than current GWAS to make comparable numbers ofdiscoveries for underlying susceptibility loci.Understanding its implications for discoveries made us question the implicationof genetic architecture for risk-prediction, another hotly debated topic.Interestingly, while the modern statistical literature is very rich regarding optimalalgorithms for building models, very little attention is given to morefundamental design questions, such as how our ability to predict a trait isinherently limited by sample-size of the training datasets and the genetic architectureof the trait, or more generally the etiologic architecture that mayinvolve both genetic and non-genetic factors. This motivated us to develop amathematical approximation for the relationship between expected predictiveperformance of models, sample size of training datasets and genetic architectureof traits. Based on these formulations, we projected that highly polygenicnature of complex traits implies future GWAS will require extremely largesample sizes, possibly of a higher order magnitude than even some of thelargest GWAS to date, for substantial improvement of risk-prediction basedon genetic information (Chatterjee et al., 2013).Although the study of genetic architecture and its implications for studydesigns is now a significant part of my research portfolio, it was not by design


N. Chatterjee 183by any means. I just stumbled upon the problem when I was attempting to dobasic power calculations for a collaborator. Looking back, it was a risky thingto undertake as I was not sure where my effort was going to lead to, otherthan maybe I could advise my collaborators a little more intelligently aboutstudy designs. It was more tempting to focus, like many of my peers have done,sometimes very successfully, on so-called “hot” problems such as developing anoptimal association test. Although I have put some effort in those areas as well,today I am really glad that instead of chasing more obvious problems, I gavemyself the freedom to venture into unknown territories. The experimentationhas certainly helped me, and hopefully the larger scientific community, toobtain some fresh insights into study designs, statistical power calculationsand risk-prediction in the context of modern high-throughput studies.17.5 The post-GWAS era: What does it all mean?It is quite amazing that even more than two decades after the BRCA1/2mutations were discovered, there is so much ambiguity about what are the truerisks associated with these genes for various cancers. In the literature, availableestimates of lifetime-risk of breast cancer, for example, vary from 20–90%. Asnoted above, while estimates available from highly-enriched cancer familiestend to reside at the higher range, their counterparts from population-basedstudies tend to be more at the lower range. Risk for an individual carrier wouldalso depend on other information, such as the specific mutation type, cancerhistory among family members and information on other risk-factors. Theproblem of estimation of risk associated with rare highly penetrant mutationsposes many interesting statistical challenges and has generated a large volumeof literature.Discoveries from genome-wide association studies are now fueling the debatehow discovery of low penetrant common variants can be useful for publichealth. Some researchers argue that common variants, irrespective of howmodest their effects are, can individually or collectively highlight interestingbiologic pathways that are involved in the pathogenesis of a disease and hencepotentially be useful for development of drug targets. Although this would bea highly desirable outcome, skepticism exists given that discoveries of evenmajor genes like BRCA1/2 have seldom led to successful development of drugtargets. Utility of common variants for genetic risk prediction is also now amatter of great debate. While a number of early studies painted mostly anegative picture, large numbers of discoveries from the most recent very largeGWAS suggests that there is indeed potential for common variants to improverisk-prediction.I am an avid follower of this debate. While the focus of the genetics communityis quickly shifting towards what additional discoveries are possible using


184 The road travelledfuture whole-exome or genome sequencing studies, I keep wondering abouthow the knowledge from GWAS could be further refined and ultimately appliedto improve public health. In general, I see the pattern that wheneveranewtechnologyemerges,thereistremendousinterestaboutmakingnew“discovery,” but the effort is not proportionate when it comes to following-upthese discoveries for better “characterization” of risk or/and the underlyingcausal mechanism. Interestingly, while in the discovery effort statisticians facestiff competition from researchers in other quantitative disciplines, like geneticists,bioinformaticians, computer scientists and physicists, statisticians havethe potential to develop a more exclusive niche in the “characterization” stepswhere the underlying inferential issues are often much more of complex naturethan simple hypothesis testing.17.6 ConclusionIn the last fourteen years, the biggest change I observe within myself is howI think about a problem. When I started working on refinement of kin-cohortmethods, I focused on developing novel methods but was not very aware ofall the underlying very complex clinical and epidemiologic subject-matter issues.Now that I am struggling with the question of what would the discoveriesfrom current GWAS and future sequencing studies mean for personalizedmedicine and public health, I feel I have a better appreciation of those pertinentscientific issues and the related debate. For this, I owe much to thehighly stimulating scientific environment of our Division, created and fosteredover decades by our recently retired director Dr. Joseph Fraumeni Jr. and anumber of other leaders. The countless conversations and debates I had withmy statistician, epidemiologist and geneticist colleagues in the meeting rooms,corridors and cafeteria of DCEG about cutting-edge issues for cancer genetics,epidemiology and prevention had a major effect on me. At the same time, mytraining in theory and methods always guides my thinking about these appliedproblems in a statistically rigorous way. I consider myself to be fortunate toinherit my academic “genes” through training in statistics and biostatisticsfrom the Indian Statistical Institute and the University of Washington, andthen be exposed to the great “environment” of DCEG for launching my careeras a statistical scientist.Congratulations to COPSS for sustaining and supporting such a greatprofession as ours for 50 years! It is an exciting time to be a statisticianin the current era of data science. There are tremendous opportunities forour profession which also comes with tremendous responsibility. While I wasfinishing up this chapter, the US Supreme Court ruled that genes are not“patentable,” implying genetic tests would become more openly available toconsumers in the future. The debate about whether Angelina Jolie got the


N. Chatterjee 185right information about her risk from BRCA1/2 testing is just a reminderabout the challenge that lies ahead for all of us to use genetic and other typesof biomedical data to create objective “knowledge” that will benefit, and notmisguide or harm, medical researchers, clinicians and most importantly thepublic.ReferencesAllen, H.L. et al. (2010). Hundreds of variants clustered in genomic loci andbiological pathways affect human height. Nature, 467:832–838.Anderson, C.A., Boucher, G., Lees, C.W., Franke, A., D’Amato, M., Taylor,K.D., Lee, J.C., Goyette, P., Imielinski, M., Latiano, A. et al. (2011).Meta-analysis identifies 29 additional ulcerative colitis risk loci, increasingthe number of confirmed associations to 47. Nature Genetics, 43:246–252.Bhattacharjee, S., Wang, Z., Ciampa, J., Kraft, P., Chanock, S., Yu, K., andChatterjee, N. (2010). Using principal components of genetic variationfor robust and powerful detection of gene-gene interactions in case-controland case-only studies. The American Journal of Human Genetics,86:331–342.Chatterjee, N., and Carroll, R.J. (2005). Semiparametric maximum likelihoodestimation exploiting gene-environment independence in casecontrolstudies. Biometrika, 92:399–418.Chatterjee, N., Chen, Y.-H., and Breslow, N.E. (2003). A pseudoscore estimatorfor regression problems with two-phase sampling. Journal of theAmerican Statistical Association, 98:158–168.Chatterjee, N., Kalaylioglu, Z., and Carroll, R.J. (2005). Exploiting geneenvironmentindependence in family-based case-control studies: Increasedpower for detecting associations, interactions and joint effects. GeneticEpidemiology, 28:138–156.Chatterjee, N. and Wacholder, S. (2001). A marginal likelihood approach forestimating penetrance from kin-cohort designs. Biometrics, 57:245–252.Chatterjee, N., Wheeler, B., Sampson, J., Hartge, P., Chanock, S.J., andPark, J.-H. (2013). Projecting the performance of risk prediction based onpolygenic analyses of genome-wide association studies. Nature Genetics,45:400–405.Eeles, R.A., Al Olama, A.A., Benlloch, S., Saunders, E.J., Leongamornlert,D.A., Tymrakiewicz, M., Ghoussaini, M., Luccarini, C., Dennis, J.,


186 The road travelledJugurnauth-Little, S. et al. (2013). Identification of 23 new prostate cancersusceptibility loci using the icogs custom genotyping array. NatureGenetics, 45:385–391.Hall, J.M., Lee, M.K., Newman, B., Morrow, J.E., Anderson, L.A., Huey, B.,King, M.-C. et al. (1990). Linkage of early-onset familial breast cancer tochromosome 17q21. Science, 250:1684–1689.Hunter, D.J., Kraft, P., Jacobs, K.B., Cox, D.G., Yeager, M., Hankinson,S.E., Wacholder, S., Wang, Z., Welch, R., Hutchinson, A. et al. (2007).A genome-wide association study identifies alleles in FGFR2 associatedwith risk of sporadic postmenopausal breast cancer. Nature Genetics,39:870–874.Jolie, A. (May 14, 2013). My Medical Choice. The New York Times.Michailidou, K., Hall, P., Gonzalez-Neira, A., Ghoussaini, M., Dennis, J.,Milne, R.L., Schmidt, M.K., Chang-Claude, J., Bojesen, S.E., Bolla, M.K.et al. (2013). Large-scale genotyping identifies 41 new loci associated withbreast cancer risk. Nature Genetics, 45:353–361.Modan, B., Hartge, P., Hirsh-Yechezkel, G., Chetrit, A., Lubin, F., Beller, U.,Ben-Baruch, G., Fishman, A., Menczer, J., Struewing, J.P. et al. (2001).Parity, oral contraceptives, and the risk of ovarian cancer among carriersand noncarriers of a BRCA1 or BRCA2 mutation. New England Journalof Medicine, 345:235–240.Mukherjee, B. and Chatterjee, N. (2008). Exploiting gene-environment independencefor analysis of case–control studies: An empirical Bayes-typeshrinkage estimator to trade-off between bias and efficiency. Biometrics,64:685–694.Park, J.-H., Gail, M.H., Weinberg, C.R., Carroll, R.J., Chung, C.C., Wang,Z., Chanock, S.J., Fraumeni, J.F., and Chatterjee, N. (2011). Distributionof allele frequencies and effect sizes and their interrelationships for commongenetic susceptibility variants. Proceedings of the National Academyof Sciences, 108:18026–18031.Park, J.-H., Wacholder, S., Gail, M.H., Peters, U., Jacobs, K.B., Chanock,S.J., and Chatterjee, N. (2010). Estimation of effect size distribution fromgenome-wide association studies and implications for future discoveries.Nature Genetics, 42:570–575.Piegorsch, W.W., Weinberg, C.R., and Taylor, J.A. (1994). Non-hierarchicallogistic models and case-only designs for assessing susceptibility inpopulation-based case-control studies. Statistics in Medicine, 13:153–162.Struewing, J.P., Hartge, P., Wacholder, S., Baker, S.M., Berlin, M.,McAdams, M., Timmerman, M.M., Brody, L.C., and Tucker, M.A. (1997).


N. Chatterjee 187The risk of cancer associated with specific mutations of BRCA1 andBRCA2 among Ashkenazi Jews. New England Journal of Medicine,336:1401–1408.Umbach, D.M. and Weinberg, C.R. (1997). Designing and analysing casecontrolstudies to exploit independence of genotype and exposure. Statisticsin Medicine, 16:1731–1743.Wacholder, S., Hartge, P., Struewing, J.P., Pee, D., McAdams, M., Brody, L.,and Tucker, M. (1998). The kin-cohort study for estimating penetrance.American Journal of Epidemiology, 148:623–630.Yeager, M., Orr, N., Hayes, R.B., Jacobs, K.B., Kraft, P., Wacholder, S.,Minichiello, M.J., Fearnhead, P., Yu, K., Chatterjee, N. et al. (2007).Genome-wide association study of prostate cancer identifies a second risklocus at 8q24. Nature Genetics, 39:645–649.


18AjourneyintostatisticalgeneticsandgenomicsXihong LinDepartment of BiostatisticsHarvard School of Public Health, Boston, MAThis chapter provides personal reflections and lessons I learned through myjourney into the field of statistical genetics and genomics in the last few years.I will discuss the importance of being both a statistician and a scientist;challenges and opportunities in analyzing massive genetic and genomic data,and training the next generation statistical genetic and genomic scientists inthe ’omics era.18.1 The ’omics eraThe human genome project in conjunction with the rapid advance of highthroughput technology has transformed the landscape of health science researchin the last ten years. Scientists have been assimilating the implicationsof the genetic revolution, characterizing the activity of genes, messengerRNAs, and proteins, studying the interplay of genes and the environment incausing human diseases, and developing strategies for personalized medicine.Technological platforms have advanced to a stage where many biological entities,e.g., genes, transcripts, and proteins, can be measured on the wholegenome scale, yielding massive high-throughput ’omics data, such as genetic,genomic, epigenetic, proteomic, and metabolomic data. The ’omics era providesan unprecedented promise of understanding common complex diseases,developing strategies for disease risk assessment, early detection, and preventionand intervention, and personalized therapies.The volume of genetic and genomic data has exploded rapidly in the lastfew years. Genome-wide association studies (GWAS) use arrays to genotype500,000–5,000,000 common Single Nucleotide Polymorphisms (SNPs) acrossthe genome. Over a thousand of GWASs have been conducted in the last189


190 Journey into genetics and genomicsfew years. They have identified hundreds of common genetic variants thatare associated with complex traits and diseases (http://www.genome.gov/gwastudies/). The emerging next generation sequencing technology offers anexciting new opportunity for sequencing the whole genome, obtaining informationabout both common and rare variants and structural variation. The nextgeneration sequencing data allow to explore the roles of rare genetic variantsand mutations in human diseases. Candidate gene sequencing, whole exomesequencing and whole genome sequencing studies are being conducted. HighthroughputRNA and epigenetic sequencing data are also becoming rapidlyavailable to study gene regulation and functionality, and the mechanisms ofbiological systems. A large number of public genomic databases, such as theHapMap Project (http://hapmap.ncbi.nlm.nih.gov/), the 1000 genomesproject (www.1000genomes.org), are freely available. The NIH database ofGenotypes and Phenotypes (dbGaP) archives and distributes data from manyGWAS and sequencing studies funded by NIH freely to the general researchcommunity for enhancing new discoveries.The emerging sequencing technology presents many new opportunities.Whole genome sequencing measures the complete DNA sequence of thegenome of a subject at three billion base-pairs. Although the current costof whole genome sequencing prohibits conducting large scale studies, with therapid advance of biotechnology, the “1000 dollar genome” era will come in thenear future. This provides a new era of predictive and personalized medicineduring which the full genome sequencing for an individual or patient costsonly $1000 or lower. Individual subject’s genome map will facilitate patientsand physicians with identifying personalized effective treatment decisions andintervention strategies.While the ’omics era presents many exciting research opportunities, theexplosion of massive information about the human genome presents extraordinarychallenges in data processing, integration, analysis and result interpretation.The volume of whole genome sequencing data is substantially largerthan that of GWAS data, and is in the magnitude of tens or hundreds ofterabites (TBs). In recent years, limited quantitative methods suitable for analyzingthese data have emerged as a bottleneck for effectively translating richinformation into meaningful knowledge. There is a pressing need to developstatistical methods for these data to bridge the technology and informationtransfer gap in order to accelerate innovations in disease prevention and treatment.As noted by John McPherson, from the Ontario Institute for CancerResearch,“There is a growing gap between the generation of massively parallelsequencing output and the ability to process and analyze the resultingdata. Bridging this gap is essential, or the coveted $1000 genome willcome with a $20,000 analysis price tag.” (McPherson, 2009)This is an exciting time for statisticians. I discuss in this chapter how I becameinterested in statistical genetics and genomics a few years ago, lessons


X. Lin 191Ilearnedwhilemakingmyjourneyintothisfield.Iwilldiscussafewopenand challenging problems to demonstrate statistical genetics and genomics isa stimulating field with many opportunities for statisticians to make methodologicaland scientific contributions. I will also discuss training the next generationquantitative scientists in the ’omics era.18.2 My move into statistical genetics and genomicsMoving into statistical genetics and genomics is a significant turn in my career.My dissertation work was on GLMMs, i.e., generalized linear mixed models(Breslow and Clayton, 1993). In the first twelve years out of graduate school,I had been primarily working on developing statistical methods for analysis ofcorrelated data, such as mixed models, measurement errors, and nonparametricand semiparametric regression for longitudinal data. When I obtained myPhD degree, I had quite limited knowledge about nonparametric and semiparametricregression using kernels and splines. Learning a new field is challenging.One is more likely to be willing to invest time and energy to learna new field when stimulated by problems in an open new area and identifyingniches. I was fascinated by the opportunities of developing nonparametricand semiparametric regression methods for longitudinal data as little workhad been done in this area and there were plenty of open problems. Such anexperience can be rewarding if timing and environment are right and goodcollaborators are found. One is more likely to make unique contributions toafieldwhenitisstillatanearlystageofdevelopment.Thisexperiencealsospeaks well of the lessons I learned in my journey into statistical genetics andgenomics.After I moved to the Harvard School of Public Health in 2005, I wasinterested in exploring new areas of research. My collaborative projects turnedout to be mainly in genetic epidemiological studies and environmental geneticstudies, a field I had little knowledge about. In the next several years, I wasgradually engaged in several ongoing genome-wide association studies, DNAmethylation studies, and genes and environment studies. I was fascinated bythe challenges in the analysis of large genetic and genomic data, and richmethodological opportunities for addressing many open statistical problemsthat are likely to facilitate new genetic discovery and advance science. Atthe same time, I realized that to make contributions in this field, one has tounderstand genetics and biology well enough in order to identify interestingproblems and develop methods that are practically relevant and useful. In mysabbatical year in 2008, I decided to audit a molecular biology course, whichwas very helpful for me to build a foundation in genetics and understand thegenetic jargon in my ongoing collaborative projects. This experience preparedme to get started working on methodological research in statistical genetics


192 Journey into genetics and genomicsand genomics. Looking back, a good timing and a stimulating collaborativeenvironment made my transition easier. In the mean time, moving into a newfield with limited background requires patience, courage, and willingness tosacrifice, e.g., having a lower productivity in the first few years, and moreimportantly, identifying a niche.18.3 A few lessons learnedImportance to be a scientist besides a statistician: While working onstatistical genetics in the last few years, an important message I appreciatemore and more is to be a scientist first and then a statistician. To make aquantitative impact in the genetic field, one needs to be sincerely interestedin science, devote serious time to learn genetics well enough to identify importantproblems, and closely collaborate with subject-matter scientists. It isless a good practice to develop methods first and then look for applications ingenetics to illustrate the methods. By doing so, it would be more challengingto make such methods have an impact in real world practice, and it is morelikely to follow the crowd and work on a problem at a later and more maturedstage of the area. Furthermore, attractive statistical methods that arelikely to be popular and advance scientific discovery need to integrate geneticknowledge well in method development. This will require a very good knowledgeof genetics, and identifying cutting-edge scientific problems that requirenew method development, and developing a good sense of important and lessimportant problems.Furthermore, the genetic field is more technology-driven than many otherhealth science areas, and technology moves very fast. Statistical methods thatwere developed for data generated by an older technology might not be applicablefor data generated by new technology. For example, normalizationmethods that work well for array-based technology might not work well forsequencing-based technology. Statistical geneticists hence need to closely followtechnological advance.Simple and computationally efficient methods carry an importantrole: To make analysis of massive genetic data feasible, computationallyefficient and simple enough methods that can be easily explained to practitionersare often more advantageous and desirable. An interesting phenomenon isthat simple classical methods seem to work well in practice. For example, inGWAS, simple single SNP analysis has been commonly used in both the discoveryphase and the validation phase, and has led to discovery of hundreds ofSNPs that are associated with disease phenotypes. This presents a significantchallenge to statisticians who are interested in developing more advanced andsophisticated methods that can be adopted for routine use and outperform


X. Lin 193these simple methods in practical settings, so that new scientific discoveriescan be made that are missed by the simple methods.Importance of developing user-friendly open-access software: Asthe genetic field moves fast especially with the rapid advance of biotechnology,timely development of user-friendly and computationally efficient openaccess software is critical for a new method to become popular and used bypractitioners. The software is more likely to be used in practice if it allowsdata to be input using the standard formats of genetic data. For example, forGWAS data, it is useful to allow to input data in the popular Plink format.Furthermore, a research team is likely to make more impact if a team memberhas a strong background in software engineering and facilitates softwaredevelopment.Importance of publishing methodological papers in genetic journals:To increase the chance for a new statistical method to be used in thegenetic community, it is important to publish the method in leading geneticjournals, such as Nature Genetics, American Journal of Human Genetics, andPlos Genetics, and make the paper readable to this audience. Thus strong communicationskills are needed. Compared to statistical journals, genetic journalsnot only have a faster review time, but also and more importantly have thereadership that is more likely to be interested in these methods and be immediateusers of these methods. By publishing methodological papers in geneticjournals, the work is more likely to have an impact in real-world practice andspeed up scientific discovery; also, as a pleasant byproduct a paper gets morecitations. It is a tricky balance in terms of publishing methodological papersin genetic journals or statistical journals.18.4 A few emerging areas in statistical genetics andgenomicsTo demonstrate the excitement of the field and attract more young researchersto work on this field, I provide in this section a few emerging areas in statisticalgenetics that require advanced statistical method developments.18.4.1 Analysis of rare variants in next generationsequencing association studiesGWAS has been successful in identifying susceptible common variants associatedwith complex diseases and traits. However, it has been found that diseaseassociated common variants only explain a small fraction of heritability(Manolio et al., 2009). Taking lung cancer as an example, the relative risks ofthe genetic variants found to be associated with lung cancer in GWAS (Hunget al., 2008) are much smaller than those from traditional epidemiological or


194 Journey into genetics and genomicsenvironmental risk factors, such as cigarette smoking, radon and asbestos exposures.Different from the “common disease, common variants” hypothesisbehind GWAS, the hypothesis of “common disease, multiple rare variants”has been proposed (Dickson et al., 2010; Robinson, 2010) as a complementaryapproach to search for the missing heritability.The recent development of Next Generation Sequencing (NGS) technologiesprovides an exciting opportunity to improve our understanding of complexdiseases, their prevention and treatment. As shown by the 1000 GenomeProject (The 1000 Genomes Project Consortium, 2010) and the NHLBI ExomeSequencing Project (ESP) (Tennessen et al., 2012), a vast majority ofvariants on the human genome are rare variants. Numerous candidate genes,whole exome, and whole genome sequencing studies are being conducted toidentify disease-susceptibility rare variants. However, analysis of rare variantsin sequencing association studies present substantial challenges (Bansal et al.,2010; Kiezun et al., 2012) due to the presence of a large number of rare variants.Individual SNP based analysis commonly used in GWAS has little powerto detect the effects of rare variants. SNP set analysis has been advocated toimprove power by assessing the effects of a group of SNPs in a set, e.g., usinga gene, a region, or a pathway. Several rare variant association tests have beenproposed recently, including burden tests (Morgenthaler and Thilly, 2007; Liand Leal, 2008; Madsen and Browning, 2009), and non-burden tests (Lin andTang, 2011; Neale et al., 2011; Wu et al., 2011; Lee et al., 2012). A commontheme of these methods is to aggregate individual variants or individual teststatistics within a SNP set. However, these tests suffer from power loss whena SNP set has a large number of null variants. For example, a large geneoften has a large number of rare variants, with many of them being likely tobe null. Aggregating individual variant test statistics is likely to introduce alarge amount of noises when the number of causal variants is small.To formulate the problem in a statistical framework, assume n subjectsare sequenced in a region, e.g., a gene, with p variants. For the ith subject, letY i be a phenotype (outcome variable), G i =(G i1 ,...,G ip ) ⊤ be the genotypesof p variants (G ij =0, 1, 2) for 0, 1, or 2 copies of the minor allele in a SNPset, e.g., a gene/region, X i =(x i1 ,...,x iq ) ⊤ be a covariate vector. Assumethe Y i are independent and follow a distribution in the exponential familywith E(y i )=µ i and var(y i )=φv(µ i ), where v is a variance function. Wemodel the effects of p SNPs G i in a set, e.g., a gene, and covariates X i on acontinuous/categorical phenotype using the generalized linear model (GLM)(McCullagh and Nelder, 1989),g(µ i )=X ⊤ i α + G ⊤ i β, (18.1)where g is a monotone link function, α =(α 1 ,...,α q ) ⊤ and β =(β 1 ,...,β p ) ⊤are vectors of regression coefficients for the covariates and the genetic variants,respectively. The n × p design matrix G =(G 1 ,...,G n ) ⊤ is very sparse, witheach column containing only a very small number of 1 or 2 and the rest being


X. Lin 1950’s. The association between a region consisting of the p rare variants G i andthe phenotype Y can be tested by evaluating the null hypothesis that H 0 :β =(β 1 ,...,β p ) ⊤ = 0. As the genotype matrix G is very sparse and p mightbe moderate or large, estimation of β is difficult. Hence the standard p-DFWald and LR tests are difficult to carry out and also lose power when p is large.Further, if the alternative hypothesis is sparse, i.e., only a small fraction ofβ’s are non-zero but one does not know which ones are non-zeros, the classicaltests do not effectively take the knowledge of the sparse alternative and thesparse design matrix into account.18.4.2 Building risk prediction models using whole genomedataAccurate and individualized prediction of risk and treatment response playsa central role in successful disease prevention and treatment. GWAS andGenome-wide Next Generation Sequencing (NGS) studies present rich opportunitiesto develop a risk prediction model using massive common andrare genetic variants across the genome and well known risk factors. Thesemassive genetic data hold great potential for population risk prediction, aswell as improving prediction of clinical outcomes and advancing personalizedmedicine tailored for individual patients. It is a very challenging statisticaltask to develop a reliable and reproducible risk prediction model using millionsor billions of common and rare variants, as a vast majority of thesevariants are likely to be null variants, and the signals of individual variantsare often weak.The simple strategy of building risk prediction models using only the variantsthat are significantly associated with diseases and traits after scanning thegenome miss a substantial amount of information. For breast cancer, GWASshave identified over 32 SNPs that are associated with breast cancer risk. Althougha risk model based on these markers alone can discriminate casesand controls better than risk models incorporating only non-genetic factors(Hüsing et al., 2012), the genetic risk model still falls short of what shouldbe possible if all the genetic variants driving the observed familial aggregationof breast cancer were known: the AUC is .58 (Hüsing et al., 2012) versusthe expected maximum of .89 (Wray et al., 2010). Early efforts of including alarge number of non-significant variants from GWAS in estimating heritabilitymodels show encouraging promises (Yang et al., 2010).The recent advancement in NGS holds great promises in overcoming suchdifficulties. The missing heritability could potentially be uncovered by rare anduncommon genetic variants that are missed by GWAS (Cirulli and Goldstein,2010). However, building risk prediction models using NGS data present substantialchallenges. First, there are a massive number of rare variants cross thegenome. Second, as variants are rare and the data dimension is large, theireffects are difficult to be estimated using standard MLEs. It is of substan-


196 Journey into genetics and genomicsFIGURE 18.1Causal mediation diagram: S is a SNP; G is a mediator, e.g., gene expression;Y is an outcome; and X a vector of covariates.tial interest to develop statistical methods for risk prediction using massiveNGS data.18.4.3 Integration of different ’omics data and mediationanalysisAn important emerging problem in genetic and genomic research is how tointegrate different types of genetic and genomic data, such as SNPs, geneexpressions, DNA methylation data, to improve understanding of disease susceptibility.The statistical problem of jointly modeling different types of geneticand genomic data and their relationships with a disease can be describedusing a causal diagram (Pearl, 2001) and be framed using a causal mediationmodel (VanderWeele and Vansteelandt, 2010) based on counterfactuals. Forexample, one can jointly model SNPs, gene expressions and a disease outcomeusing the causal diagram in Figure 18.1, with gene expression serving as apotential mediator.To formulate the problem, assume for subject i ∈{1,...,n}, an outcomeof interest Y i is dichotomous (e.g., case/control), whose mean is associatedwith q covariates (X i ), p SNPs (S i ), mRNA expression of a gene (G i ) andpossibly interactions between the SNPs and the gene expression aslogit{Pr(Y i =1|S i ,G i , X i )} = X ⊤ i β X + S ⊤ i β S + G i β G + S ⊤ i G i β GS , (18.2)where β X ,β S ,β G ,β GS are the regression coefficients for the covariates, theSNPs, the gene expression, and the interactions of the SNPs and the gene


X. Lin 197expression, respectively. The gene expression G i (i.e., the mediator) dependson the q covariates (X i ) and the p SNPs (S i ) through a linear model, asG i = X ⊤ i α X + S ⊤ i α S + ɛ i , (18.3)where α X and α S are the regression coefficients for the covariates and theSNPs, respectively. Here, ɛ i follows a Normal distribution with mean 0 andvariance σ 2 G .The total effect (TE) of SNPs of the disease outcome Y can be decomposedinto the Direct Effect (DE) and the Indirect Effect (IE). The Direct Effect ofSNPs is the effect of the SNPs on the disease outcome that is not throughgene expression, whereas the Indirect Effect of the SNPs is the effect of theSNPs on the disease outcome that is through the gene expression. Under nounmeasured confounding assumptions (VanderWeele and Vansteelandt, 2010),the TE, DE and IE can be estimated from the joint causal mediation models(18.2)–(18.3). In genome-wide genetic and genomic studies, the numbers ofSNPs (S) and gene expressions (G) are both large. It is of interest to developmediation analysis methods in such settings.18.5 Training the next generation statistical genetic andgenomic scientists in the ’omics eraTo help expedite scientific discovery in the ’omics era and respond to the pressingquantitative needs for handling massive ’omics data, there is a significantneed to train a new generation of quantitative genomic scientists through anintegrative approach designed to meet the challenges of today’s biomedical science.The traditional biostatistical training does not meet the need. We needto train a cadre of interdisciplinary biostatisticians with strong quantitativeskills and biological knowledge to work at the interface of biostatistics, computationalbiology, molecular biology, and population and clinical science genomics.They will be poised to become quantitative leaders in integrative andteam approaches to genetic research in the public health and medical arenas.Trainees are expected to (1) have strong statistical and computational skillsfor development of statistical and computational methods for massive ’omicsdata and for integration of large genomic data from different sources; (2) havesufficient biological knowledge and understanding of both basic science andpopulation science; (3) work effectively in an interdisciplinary research environmentto conduct translation research from basic science to populationand clinical sciences; (4) play a quantitative leadership role in contributing tofrontier scientific discovery and have strong communication skills to be ableto engage in active discussions of the substance of biomedical research.


198 Journey into genetics and genomicsRecent advances in genomic research have shown that such integratedtraining is critical. First, biological systems are complex. It is crucial to understandhow the biological systems work. Such knowledge facilitates resultinterpretation and new scientific discovery in population sciences and clinicalsciences. Computational biology plays a pivotal role in understanding the complexityof the biological system and integrating sources of different genomicdata. Statistical methods provide systematic and rigorous tools for analyzingcomplex biological data and allow for making statistical inference accountingfor randomness in data. Second, many complex diseases are likely to begoverned by the interplay of genes and environment. The 2003 IOM Committeeon “Assuring the Public’s Health” has argued that advances in healthwill require a population health perspective that integrates understanding ofbiological and mechanistic science, human behavior, and social determinantsof health. Analysis of GWAS and whole genome sequencing (WGS) data requiresdevelopment of advanced biostatistical, computational, and epidemiologicalmethods for big data. The top SNPs identified from the GWAS andWGS scan often have unknown functions. Interpretation of these findings requiresbioinformatics tools and data integration, e.g., connecting SNP datawith gene expression or RNA-seq data (eQTL data). Furthermore, to increaseanalysis power, integration with other genomic information, such as pathwaysand networks, in statistical analysis is important.Ground breaking research and discovery in the life sciences in the 21stcentury are more interdisciplinary than ever, and students studying withinthe life sciences today can expect to work with a wider range of scientistsand scholars than their predecessors could ever have imagined. One needsto recognize this approach to scientific advancement when training the nextgeneration of quantitative health science students. Rigorous training in thecore statistical theory and methods remains important. In addition, studentsmust have a broad spectrum of quantitative knowledge and skills, especiallyin the areas of statistical methods for analyzing big data, such as statisticaland machine learning methods, more training in efficient computational methodsfor large data, programming, and information sciences. Indeed, analysisof massive genomic data requires much stronger computing skills than whatis traditionally offered in biostatistics programs. Besides R, students are advantageousto learn other programming languages, such as scripts, pythonand perl.The next generation statistical genetic and genomic scientists should userigorous statistical methods to analyze the data, interpret results, harnessthe power of computational biology to inform scientific hypotheses, and workeffectively as leading quantitative scientists with subject-matter scientists engagedin genetic research in basic sciences, population science and clinicalscience. To train them, we need to develop an interdisciplinary curriculum,foster interactive research experiences in laboratory rotations ranging fromwet labs on biological sciences to dry labs (statistical genetics, computationalbiology, and genetic epidemiology), developing leadership and communication


X. Lin 199skills through seminars, workshops and projects to enable our trainees to meetmodern challenges for conducting translational genomic research.18.6 Concluding remarksWe are living in an exciting time of genetic and genomic science, where massive’omics data present statisticians with many opportunities and challenges. Totake a full advantage of the opportunities and meet the challenges, we needto strategically broaden our roles and quantitative and scientific knowledge,so that we can play a quantitative leadership role as statistical genetic andgenomic scientists in both method development and scientific discovery. It isimportant to develop new strategies to train our students along these linesso they can succeed in the increasingly interdisciplinary research environmentwith massive data. With the joint effort of our community, we can best positionourselves and the younger generation as quantitative leaders for new scientificdiscovery in the ’omics era.ReferencesBansal, V., Libiger, O., Torkamani, A., and Schork, N.J. (2010). Statisticalanalysis strategies for association studies involving rare variants. NatureReviews Genetics,11:773–785.Breslow, N.E. and Clayton, D.G. (1993). Approximate inference in generalizedlinear mixed models. Journal of the American Statistical Association,88:9–25.Cirulli, E.T. and Goldstein, D.B. (2010). Uncovering the roles of rare variantsin common disease through whole-genome sequencing. Nature ReviewsGenetics, 11:415–425.Dickson, S.P., Wang, K., Krantz, I., Hakonarson, H., and Goldstein, D.B.(2010). Rare variants create synthetic genome-wide associations. PLoSBiology, 8:e1000294, doi:10.1371/journal.pbio.1000294.Hung, R.J., McKay, J.D., Gaborieau, V., Boffetta, P., Hashibe, M., Zaridze,D., Mukeria, A., Szeszenia-Dabrowska, N., Lissowska, J., Rudnai, P., Fabianova,E., Mates, D., Bencko, V., Foretova, L., Janout, V., Chen, C.,Goodman, G., Field, J.K., Liloglou, T., Xinarianos, G., Cassidy, A. et al(2008). A subunit genes on 15q25. Nature, 452:633–637.


200 Journey into genetics and genomicsHüsing, A., Canzian, F., Beckmann, L., Garcia-Closas, M., Diver, W.R.,Thun, M.J., Berg, C.D., Hoover, R.N., Ziegler, R.G., Figueroa, J.D. et al.(2012). Prediction of breast cancer risk by genetic risk factors, overall andby hormone receptor status. Journal of Medical Genetics, 49:601–608.Kiezun, A., Garimella, K., Do, R., Stitziel, N.O., Neale, B.M., McLaren, P.J.,Gupta, N., Sklar, P., Sullivan, P.F., Moran, J.L., Hultman, C.M., Lichtenstein,P., Magnusson, P., Lehner, T., Shugart, Y.Y., Price, A.L., deBakker, P.I.W., Purcell, S.M., and Sunyaev, S.R. (2012). Exome sequencingand the genetic basis of complex traits. Nature Genetics, 44:623–630.Lee, S., Wu, M.C., and Lin, X. (2012). Optimal tests for rare variant effectsin sequencing association studies. Biostatistics, 13:762–775.Li, B. and Leal, S.M. (2008). Methods for detecting associations with rarevariants for common diseases: Application to analysis of sequence data.The American Journal of Human Genetics, 83:311–321.Lin, D.Y. and Tang, Z.Z. (2011). A general framework for detecting diseaseassociations with rare variants in sequencing studies. The AmericanJournal of Human Genetics, 89:354–367.Madsen, B.E. and Browning, S.R. (2009). A groupwise association test forrare mutations using a weighted sum statistic. PLoS Genet, 5:e1000384.Manolio, T.A., Collins, F.S., Cox, N.J., Goldstein, D.B., Hindorff, L.A.,Hunter, D.J., McCarthy, M.I., Ramos, E.M., Cardon, L.R., Chakravarti,A., Cho, J.H., Guttmacher, A.E., Kong, A., Kruglyak, L., Mardis, E., Rotimi,C.N., Slatkin, M., Valle, D., Whittemore, A.S., Boehnke, M., Clark,A.G., Eichler, E.E., Gibson, G., Haines, J.L., Mackay, T.F.C., McCarroll,S.A., and Visscher, P.M. (2009). Finding the missing heritability ofcomplex diseases. Nature, 461:747–753.McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models.Chapman&Hall,London.McPherson, J.D. (2009). Next-generation gap. Nature Methods, 6:S2–S5.Morgenthaler, S. and Thilly, W.G. (2007). A strategy to discover genes thatcarry multi-allelic or mono-allelic risk for common diseases: A cohort allelicsums test (CAST). Mutation Research/Fundamental and MolecularMechanisms of Mutagenesis, 615:28–56.Neale, B.M., Rivas, M.A., Voight, B.F., Altshuler, D., Devlin, B., Orho-Melander, M., Kathiresan, S., Purcell, S.M., Roeder, K. and Daly, M.J.(2011). Testing for an unusual distribution of rare variants. PLoS Genetics,7:e1001322, doi:10.1371/journal.pgen.1001322.


X. Lin 201Pearl, J. (2001). Direct and indirect effects. In Proceedings of the SeventeenthConference on Uncertainty in Artificial Intelligence. MorganKaufmannPublishers, Burlington, MA, pp. 411–420.Robinson, R. (2010). Common disease, multiple rare (and distant) variants.PLoS Biology, 8:e1000293, doi:10.1371/journal.pbio.1000293.Tennessen, J.A., Bigham, A.W., O’Connor, T.D., Fu, W., Kenny, E.E.,Gravel, S., McGee, S., Do, R., Liu, X., Jun, G., Kang, H.M., Jordan, D.,Leal, S.M., Gabriel, S., Rieder, M.J., Abecasis, G., Altshuler, D., Nickerson,D.A., Boerwinkle, E., Sunyaev, S., Bustamante, C.D., Bamshad,M.J., Akey, J.M., Broad, G.O., Seattle, G.O., and the Exome SequencingProject. (2012). Evolution and functional impact of rare coding variationfrom deep sequencing of human exomes. Science, 337:64–69.The 1000 Genomes Project Consortium (2010). A map of human genomevariation from population scale sequencing. Nature, 467:1061–1073.VanderWeele, T.J. and Vansteelandt, S. (2010). Odds ratios for mediationanalysis for a dichotomous outcome. American Journal of Epidemiology,172:1339–1348.Wray, N.R., Yang, J., Goddard, M.E., and Visscher, P.M. (2010). The geneticinterpretation of area under the ROC curve in genomic profiling. PLoSGenetics, 6:e1000864.Wu, M.C., Lee, S., Cai, T., Li, Y., Boehnke, M., and Lin, X. (2011). Rarevariantassociation testing for sequencing data with the sequence kernelassociation test. The American Journal of Human Genetics, 89:82–93.Yang, J., Benyamin, B., McEvoy, B.P., Gordon, S., Henders, A.K., Nyholt,D.R., Madden, P.A., Heath, A.C., Martin, N.G., Montgomery, G.W.,Goddard, M.E., and Visscher, P.M. (2010). Common SNPs explain alarge proportion of the heritability for human height. Nature Genetics,42:565–569.


19Reflections on women in statistics inCanadaMary E. ThompsonDepartment of Statistics and Actuarial ScienceUniversity of Waterloo, Waterloo, ONHaving been a recipient of the Elizabeth L. Scott Award, I have chosen thesubject of women in statistics in Canada. The chapter is a selective history,combined with a series of reflections, and is intended as a tribute to womenin statistics in Canada — past, present, and future.Elizabeth L. Scott touches the story herself in a couple of ways, thoughfleetingly. I am fortunate to have met her in Delhi in December 1977, at the41st Session of the International Statistical Institute. I had studied some ofher work in modeling and inference — and had recently become aware of herpioneering approaches to the study of gender disparities in faculty salaries.I have written about some individuals by name, mainly senior people.Because there are many women in the field in Canada, I have mentionedfew whose last degree is more recent than 1995. Even so, I am conscious ofomissions, and hope that this essay may inspire others to tell the story morecompletely.19.1 A glimpse of the hidden pastThe early history of statistics in Canada is not unrecorded, but apart froma few highlights, has never been processed into a narrative. Quite possiblywomen were involved in statistics from the beginning. Before the StatisticalSociety of Canada brought statisticians together and began to record theircontributions, there were no chronicles of the profession. There may be otherstories like the following, told by historian of statistics David Bellhouse:“Last summer on a family holiday in Winnipeg, we had dinner withmy father-in-law at his seniors’ apartment building in the downtownarea. Seated at our table in the dining room were four sisters. Since203


204 Women in statistics in Canadamealtime conversations at any dinner party I have ever attended rarelyturn to the topic of statistics, I was surprised by the turn of events.For some reason that I cannot recall one of the sisters, a diminutiveand exuberant nonagenarian, stated that she was a statistician. Afterasking where she had worked and when, and where, when and withwhom she had studied, I came to the conclusion that I was talking notonly to the oldest surviving statistician in the province of Manitoba,but also to one of the first women, if not the first, to work professionallyas a statistician in Canada.” (Bellhouse, 2002)The statistician was Isobel Loutit, who was born in Selkirk, Manitoba, inJuly of 1909. She obtained a BA in mathematics with a minor in French fromthe University of Manitoba in 1929. She was taught statistics by ProfessorLloyd Warren, using textbooks by Gavett and Yule. She started out as ateacher — in those days, school teaching, nursing and secretarial work werevirtually the only career paths open to women on graduation — but with theWorld War II, as positions for women in industry opened up, she began acareer in statistical quality control. In 1969, at a time where it was still rarefor women to hold leadership positions, Loutit became chair of the MontréalSection of the American Society for Quality Control (ASQC).She was made an Honorary Member of the Statistical Society of Canada in2009, and the Business and Industrial Statistics Section instituted the IsobelLoutit Invited Address in her honor. She died in April 2009 at the age of 99.19.2 Early historical contextAlthough the first census in North America was conducted in New France byIntendant Jean Talon in 1665–66, data gathering in Canada more generallyseems to have become serious business only in the 19th century. Canadiansof the time were avidly interested in science. Zeller (1996) credits Alexandervon Humboldt’s encouragement of worldwide studies of natural phenomenaand the culture of the Scottish Enlightenment with leading to the Victorian“tradition of collecting ‘statistics’... using detailed surveys to assess resourcesand quality of life in various districts.” In Canada, such surveys for agriculturalpotential began as early as 1801 in Nova Scotia. Natural history societieswere founded in the 1820s and later, while scientific organizations began to beformed in the 1840s. Statistics was used in analyses related to public health inMontréal as early as the 1860s (Bellhouse and Genest, 2003), a few years afterthe pioneering work of Florence Nightingale and John Snow in the 1850s.In both Britain and North America there were substantial increases in opportunityfor education of women in the late 19th and early 20th centuries,and women began to enter the scientific professions. At the same time, statisticsas a mathematical discipline came into being, through the efforts of Karl


M.E. Thompson 205Pearson in England. Always on the lookout for references to the practice ofstatistics in Canada, Dr. Bellhouse has recently come across a 1915 letter inthe files of Karl Pearson from one of his assistants in the field of craniometry,New Zealander Kathleen Ryley, who had by 1915 emigrated to Canada. Wefind that she had first visited Canada on a trip to Winnipeg in August 1909,among many women from abroad attending the meetings of the British Associationfor the Advancement of Science at the University of Manitoba. Whenshe moved to Canada, she undertook the running of the Princess PatriciaRanch in Vernon, British Columbia — a fascinating story in itself (Yarmie,2003). Pearson’s biographer notes that the women who studied or worked withKarl Pearson in the early days generally had trouble finding positions in whichthey could continue statistical work (Porter, 2004).Official statistics came into its own in Québec with the founding in 1912of the Bureau de la statistique du Québec (Beaud and Prévost, 2000), and theDominion Bureau of Statistics (now Statistics Canada) came about throughthe Statistics Act in 1918; for a detailed account, see Worton (1998). Datafrom censuses conducted at least every ten years since 1851 can now be studiedto provide a picture of the evolution of the status of women in the country,particularly following digitization of samples from the census beginning with1911; see Thomas (2010).Statistics was slow to enter the academy in North America, in both theUnited States and Canada (Huntington, 1919). According to Watts (Watts,1984), statistics courses were taught at the University of Manitoba beginningin 1917 by Lloyd A.H. Warren — Isobel Loutit’s professor — who was developingcurricula for actuarial science and commerce (Rankin, 2011). Wattscites a 1918 article by E.H. Godfrey as saying that no Canadian universitywas then teaching statistics as “a separate branch of science.” However, therewere at that time in the École des hautes études commerciales in Montréal(now HEC Montréal) a Chair of Statistics and “a practical and comprehensivecurriculum”; at the University of Toronto, statistics was a subject in thesecond year of the course in Commerce and Finance.The first teachers of statistics as a subject in its own right in mathematicsdepartments included E.S. Keeping in Alberta, and George L. Edgettat Queen’s, who taught a statistics course in 1933. Daniel B. DeLury, whotaught my first course in statistics in 1962 at the University of Toronto, hadbegun teaching at the University of Saskatchewan in the period 1932–35. Somestatistics was taught by biologists, particularly those involved in genetics. Geneticistand plant breeder Cyril H. Goulden is said to have written the firstNorth American textbook in biostatistics in 1937, for the students he wasteaching at the University of Manitoba. Yet as late as 1939, Dominion StatisticianRobert H. Coats lamented that “five of our twenty-two universities donot know the word in their curricula” (Coats, 1939).


206 Women in statistics in CanadaIn 1950, the University of Manitoba was the first to recognize statisticsin the name of a department (the Department of Actuarial Mathematics andStatistics). The two first Departments of Statistics were formed at the Universityof Waterloo and the University of Manitoba in 1967. There are nowfour other separate statistics or statistics and actuarial science departments,at the University of Toronto, Western University in Ontario, the University ofBritish Columbia, and Simon Fraser University in Burnaby, British Columbia.19.3 A collection of firsts for womenWhen I was a university student in the years 1961–65, there were few womenprofessors in departments of mathematics in Canada. One of the exceptionswas Constance van Eeden, who had come in the early 1960s (Clarke, 2003).Dr. van Eeden was born in 1927 in the Netherlands. She finished high schoolin 1944, and when the Second World War was over, she attended university,graduating in 1949 with a first degree in mathematics, physics and astronomy.She then entered a new actuarial program for her second degree, and in 1954began as a part-time assistant at the Statistics Department at the Math ResearchCenter in Amsterdam. She received her PhD cum laude in 1958. Aftersome time at Michigan State and some time in Minneapolis, she and her husband(Charles Kraft, also a statistician) moved to the Université de Montréal— where, in contrast to the situation at their previous two institutions, therewas no regulation against both members of a couple having tenure in the samedepartment.A specialist in mathematical statistics and in particular estimation in restrictedparameter spaces, van Eeden was the first woman to receive the GoldMedal of the Statistical Society of Canada, in 1990. She has an extensive scientific“family tree”; two of her women students went on to academic careers inCanada: Louise Dionne (Memorial University of Newfoundland) and SoranaFroda (UniversitéduQuébec àMontréal).There are, or were, other women statisticians in Canada born in the 1920s,but most have had their careers outside academia. The 34th Session of theInternational Statistical Institute was held in Ottawa fifty years ago, in 1963— the only time the biennial meeting of the ISI has been held in Canada. TheProceedings tell us the names of the attendees. (Elizabeth L. Scott, from theUniversity of California, Berkeley, was one.) Of the 136 Canadian “guests,”14 were women. Their names are listed in Table 19.1.One of the two from universities in this list is the first and so far theonly woman to hold the position of Chief Statistician of Canada: WinnipegbornSylvia Ostry, CC OM FRSC, an economist (PhD 1954) who was ChiefStatistician from 1972 to 1975. In 1972, as she began her mandate, Dr. Ostry


M.E. Thompson 207TABLE 19.1The 14 Canadian women who attended the 34th Session of the InternationalStatistical Institute held in Ottawa, Ontario, in 1963.Marjorie M. Baskerville Dominion Foundries and Steel Ltd, HamiltonM. Anne Corbet Department of Secretary of State, OttawaI. June Forgie Department of Transport, OttawaGeraldine E. Fulton Sir George Williams University, MontréalIrene E. Johnson Department of Labour, OttawaElma I. Kennedy Department of Forestry, OttawaPamela M. Morse Department of Agriculture, OttawaMonique Mousseau Radio-Canada, MontréalSylvia OstryUniversité deMontréalDorothy J. Powell Bank of Nova Scotia, TorontoMargaret R. Prentis Department of Finance, OttawaJean R. Proctor Department of Agriculture, OttawaJoan Grace Sloman Ontario Department of Health, TorontoDorothy Walters National Energy Board, Ottawawas the first woman working in Canada to be elected a Fellow of the AmericanStatistical Association.Among statistical researchers in Canada who were born in the 1930s andearly 1940s, many (both men and women) have worked in the area of the designof experiments. One such is Agnes M. Herzberg, who was a student of Dr.Norman Shklov at the University of Saskatchewan in Saskatoon. Dr. Herzbergwent to England on an Overseas Fellowship in 1966, soon after obtaining herPhD, and stayed as a member of the Department of Mathematics at ImperialCollege until 1988, when she moved back to Canada and a professorship atQueen’s University. She was the first woman to serve as President of theStatistical Society of Canada, in 1991–92, and the first to be awarded theSSC Distinguished Service Award in 1999. In recent years, in addition toher research, Dr. Herzberg has focused much of her energy on organizing aseries of annual conferences on Statistics, Science and Public Policy, held atHerstmonceux Castle in England. These meetings are attended by a widevariety of participants, from science, public service and the press (Lawless,2012).Canada is the adopted home of Priscilla E. (Cindy) Greenwood of theUniversity of British Columbia (PhD 1963, University of Wisconsin, Madison),a distinguished probabilist who has also worked in mathematical statistics andefficient estimation in stochastic processes. Her work and love of science havebeen celebrated in a special Festschrift volume of Stochastics published in2008 (vol. 80). In 1997, she was awarded a grant of $500,000 from the PeterWall Institute of Advanced Studies for their first topic study: “Crisis Pointsand Models for Decision.”


208 Women in statistics in CanadaA rise in consciousness of the status of women in the late 1960s and early1970s was marked by several initiatives in Canada; a landmark was the tablingof the Royal Commission on the Status of Women in 1970, which recommendedthat gender-based discrimination in employment be prohibited acrossthe country (as it had been in several provinces). Women began to be electedin greater numbers to positions in the learned societies in those years. In1975, Audrey Duthie of the University of Regina and Kathleen (Subrahmaniam)Kocherlakota of the University of Manitoba were the first women tobe elected to the Board of Directors of the Statistical Science Association ofCanada, one of the ancestors of the Statistical Society of Canada (SSC); thisstory is chronicled in Bellhouse and Genest (1999). Gail Eyssen (now of theUniversity of Toronto) was elected to the SSC Board in 1976. Since that year,there has always been at least one woman serving on the Board; beginning inabout 1991, there have been several each year.The second woman to become President of the SSC was Jane F. Gentleman,whom I met first when I came to Waterloo in 1969. She was a fine rolemodel and supportive mentor. Dr. Gentleman moved to Statistics Canadain 1982, and in 1999 she became the Director of the Division of Health InterviewStatistics at the National Center for Health Statistics in Maryland.She was the winner of the first annual Janet L. Norwood Award in 2002 forOutstanding Achievement by a Woman in the Statistical Sciences.I met K. Brenda MacGibbon-Taylor of the Université du Québec àMontréal in 1993, when I was appointed to the Statistical Sciences Grant SelectionCommittee by the Natural Sciences and Engineering Research Councilof Canada (NSERC). Dr. MacGibbon was Chair of the Committee that year,the first woman to fill that position, and I came to have a great admiration forher expertise and judgment. She obtained her PhD at McGill in 1970, workingin K-analytic spaces and countable operations in topology under the supervisionof Donald A. Dawson. During her career, Dr. MacGibbon has workedin many areas of statistics, with a continuing interest in minimax estimationin restricted parameter spaces.Another student of Dawson, probabilist Gail Ivanoff, was the first womanto fill the position of Group Chair of the Mathematical and Statistical SciencesEvaluation Group at NSERC, from 2009 to 2012. Dr. Ivanoff, who works onstochastic processes indexed by sets, as well as point processes and asymptotics,is a key member of the very strong group of probabilists and mathematicalstatisticians in the Ottawa–Carleton Institute for Mathematics andStatistics.AthirdDawsonstudent,ColleenD.Cutler,wasthefirstwomantobeawarded the CRM–SSC Prize, awarded jointly by the SSC and the Centrede recherches mathématiques (Montréal) in recognition of accomplishmentsin research by a statistical scientist within 15 years of the PhD. The award,which she received in the third year of its bestowing, recognizes her work atthe interface of non-linear dynamics and statistics, and in particular non-linear


M.E. Thompson 209time series, the study of determinism in time series, and the computation offractal dimension.Professor Nancy M. Reid of the University of Toronto, another formerPresident of the SSC, is the owner of “firsts” in several categories. Most notably,she was the first woman, and the first statistician working in Canada,to receive the COPSS Presidents’ Award, in 1992. She is the only Canadianbasedstatistician to have served as President of the Institute of MathematicalStatistics (IMS), in 1997. In 1995 she was awarded the first Krieger–NelsonPrize of the Canadian Mathematical Society — a prize “in recognition of anoutstanding woman in mathematics.” She was also the first woman in Canadato be awarded a Canada Research Chair in the statistical sciences, and is theonly woman to have served as Editor of The Canadian Journal of Statistics.In this International Year of Statistics, for the first time ever, the ProgramChair and the Local Arrangements Chair of the 41st SSC Annual Meetingare both women, respectively Debbie Dupuis of HEC Montréal and RhondaRosychuk of the University of Alberta. The meeting was held in Edmonton,Alberta, May 26–29, 2013.19.4 AwardsI have always had conflicted feelings about awards which, like the ElizabethL. Scott Award, are intended to recognize the contributions of women. Whydo we have special awards for women? Is it to compensate for the fact thatrelatively few of us win prestigious research awards in open competition? Oris it rather to recognize that it takes courage to be in the field as a woman,and that those of us who are here now should be mindful of the sacrifices anddifficulties faced by our forerunners.In general, the awards in an academic or professional field recognize a fewin the context of achievements by the many. Perhaps it is as well to rememberthat awards are really mainly pretexts for celebrations. I will never forget thetoast of a student speaker at a Waterloo awards banquet: “Here’s to thosewho have won awards, and to those who merely deserve them.”The SSC has two awards named after women, and open to both menand women. Besides the Isobel Loutit Invited Address, there is the biennialLise Manchester Award. Lise Manchester of Dalhousie University in Halifaxwas the first woman to receive The Canadian Journal of Statistics Award, in1991, the second year of its bestowing, for a paper entitled “Techniques forcomparing graphical methods.” The community was saddened at the passingof this respected young researcher, teacher, and mother. The Lise ManchesterAward, established in 2007, commemorates her “abiding interest in makinguse of statistical methods to provide insights into matters of relevance tosociety at large.”


210 Women in statistics in CanadaThe other awards of the Statistical Society of Canada are also open toboth men and women. One is the afore-mentioned The Canadian Journal ofStatistics Award, for the best paper appearing in a volume of the journal.Of 23 awards from 1990 to 2012, women have been authors or co-authors of10; Nancy E. Heckman of the University of British Columbia has received ittwice, the first time (with John Rice, 1997) for a paper on line transects oftwo-dimensional random fields, and the second time (with James O. Ramsay,2000) for penalized regression with model-based penalties.Women are well represented among winners of the CRM–SSC Prize sinceits creation in 1999 (Colleen Cutler, Charmaine Dean, Grace Y. Yi). To date,three of us have received the SSC Gold Medal (Constance van Eeden, NancyReid and myself) since it was first awarded in 1985. In the first 27 awards of thePierre Robillard Award for the best thesis in probability or statistics defendedat a Canadian university in a given year, three of the winners were women(Maureen Tingley, Vera Huse-Eastwood, and Xiaoqiong Joan Hu) but sincethe year 2001 there have been eight: Grace Chiu, Rachel MacKay-Altman,Zeny Zhe-Qing Feng, Mylène Bédard, Juli Atherton, Jingjing Wu, Qian Zhou,and Bei Chen.There are just eight women who have been elected Fellow of the Instituteof Mathematical Statistics while working in Canada. The one not so far mentionedis Hélène Massam of York University, who combines mathematics andstatistics in the study of hierarchical and graphical models. I count sixteenwho have been elected Fellow of the American Statistical Association, includingThérèse Stukel of the Institute for Clinical Evaluative Sciences in Toronto(2007), Keumhee C. Chough of the University of Alberta (2009), XiaoqiongJoan Hu of Simon Fraser University (2012), Sylvia Esterby of UBC Okanagan(2013), and W. Y. Wendy Lou of the University of Toronto (2013).19.5 BuildersSince the late 1980s, women in Canada have begun to find themselves morewelcome in leadership positions in academia and in the statistics profession. Itis as though society suddenly realized at about that time that to consider onlymen for such roles was to miss out on a significant resource. In some cases,despite the ephemeral nature of “service” achievements, our leaders have leftlasting legacies.One builder in academia is Charmaine B. Dean, who came to Canadafrom San Fernando, Trinidad. She completed an Honours Bachelor’s Degree inMathematics at the University of Saskatchewan and her PhD at the Universityof Waterloo. She joined the Department of Mathematics and Statistics atSimon Fraser University in 1989, and several years later played a major rolein setting up the Department of Statistics and Actuarial Science, becoming


M.E. Thompson 211the founding Chair in 2001. She received the CRM–SSC Prize in 2003 for herwork on inference for over-dispersed generalized linear models, the analysis ofrecurrent event data, and spatial and spatio-temporal modelling for diseasemapping. In 2002, Dr. Dean was President of WNAR, the Western NorthAmerican Region of the Biometric Society; she was also President of the SSC in2006–07 and is currently Dean of the Faculty of Science at Western University,London, Ontario.Another example is Shelley Bull, a graduate of the University of Waterlooand Western University in Ontario. As a faculty member in biostatistics inthe Samuel Lunenfeld Institute of Mount Sinai General Hospital, in Toronto,she became interested in research in statistical genetics in the 1990s. Whenthe Networks of Centres of Excellence MITACS (Mathematics of InformationTechnology for Complex Systems) began in 1999, Dr. Bull became the leaderof a national team pursuing research at the interface of statistics and genetics,in both modeling and analysis, with emphasis on diseases such as breast cancerand diabetes. The cohesiveness of the group of statistical genetics researchersin Canada owes much of its origin to her project.I am proud to count myself among the builders, having chaired the Departmentof Statistics and Actuarial Science at the University of Waterloofrom 1996 to 2000. Besides Charmaine Dean and myself, other women instatistics who have chaired departments include Nancy Reid (Toronto), NancyHeckman (UBC), Karen Campbell (Epidemiology and Biostatistics, Western),Cyntha Struthers (St. Jerome’s), Sylvia Esterby (UBC Okanagan) and ChristianeLemieux (Waterloo).Nadia Ghazzali, formerly at Université Laval, has held since 2006 theNSERC Chair for Women in Science and Engineering (Québec Region). Sheis the first woman statistician to become President of a university in Canada.She was appointed Rector of the Université duQuébec àTrois-Rivières in2012.Since the late 1990s, the SSC has had an active Committee on Women inStatistics, which sponsors events at the SSC annual meetings jointly with theCanadian Section of the Caucus for Women in Statistics. Cyntha Strutherswas both the founding Chair of the Caucus section, in 1987–89, and Chair ofthe SSC Committee on Women in Statistics in 1998–2000.For professional leadership, an early example was Nicole P.-Gendreau ofthe Bureau de la statistique du Québec (which has since become the Institutde la statistique du Québec). Mme Gendreau was Public Relations Officer inthe Statistical Society of Canada from 1986 to 1989. She was the founder of theNewsletter, SSC Liaison, the chief communication vehicle of the communityin Canada, now in online and print versions, and still very much in keepingwith her original vision.The process of developing SSC Accreditation was brought to fruition in2003–04 (when I was President of the SSC) under the dedicated leadershipof Judy-Anne Chapman, now at Queen’s University. (At least 30 of the 140PStat holders to date are women.) As Chair of the SSC Education Committee,


212 Women in statistics in CanadaAlison Gibbs of the University of Toronto led the takeover of responsibilityby the SSC of the highly successful educational program, Census at SchoolCanada. Shirley Mills of Carleton University is the first woman to lead theday-to-day operations of the SSC as Executive Director.Few of us in the earlier cohorts have followed a career path in the privatesector. A notable exception is Janet E.A. McDougall, PStat, the founder andPresident of McDougall Scientific, a leading Clinical Research Organizationwhich has been in operation since 1984.19.6 Statistical practiceStatistics is a profession as well as an academic subject. Some of the beststatisticians I know spend much of their time in statistical practice, and researchthat arises from practical problems.To name just two — Dr. Jeanette O’Hara Hines was Director of the StatisticalConsulting Service (and teacher of consulting) at the University of Waterloofor many years, and specialized in working with faculty in the Departmentof Biology on a very diverse set of problems; Hélène Crépeau, Nancy Reid’sfirst Master’s student, who has been working at the Service de consultationstatistique de l’Université Laval since 1985, has been involved in biometricsresearch and studies of the quantification of wildlife populations.As in 1963, many women statisticians work in government agencies. Inearlier days at Statistics Canada, Estelle Bee Dagum developed the X–11–ARIMA method, variants of which are used for seasonal adjustment of timesseries around the world. Survey statistician Georgia Roberts of StatisticsCanada is an expert on methodology and the analytical uses of complex surveydata, and has led the Data Analysis Resource Centre for several years. In thesame area of data analysis is Pat Newcombe-Welch, Statistics Canada analystat the Southwestern Ontario Research Data Centre, who is on the front linesof assisting researchers of many disciplines in the access to Statistics Canadadata. Susana Rubin-Bleuer, Ioana Schiopu-Kratina and Lenka Mach of StatisticsCanada have published research papers in advanced theoretical topics inthe analysis of complex survey data and in sample coordination.Women statisticians have also achieved leadership roles in statistical agencies.To name just a few: Louise Bourque was Directrice de la méthodologiefor several years at the BSQ/ISQ; Nanjamma Chinnappa served as Director ofthe Business Survey Methods Division at Statistics Canada; Marie Brodeur isDirector General of the Industry Statistics Branch of Statistics Canada; andRosemary Bender is Assistant Chief Statistician of Analytical Studies andMethodology.


M.E. Thompson 21319.7 The current sceneStatistics is an important profession in Canada today. Despite a steady inflowof capable young people, the demand for talented statisticians outstrips thesupply. It is fortunate that women are entering the field in greater numbersthan ever before. The cohorts with PhDs in the 1990s are now mid-career andmaking their mark on the profession, while those just beginning are achievingthings that I would never have dreamed of at the same stage. Are there stillbarriers that are particular to women? Is it still harder for women to enterthe profession, and to rise high in its ranks?The answer is probably “yes.” When we think of it, some of the obstaclesthat were always there are hardly likely to disappear: The difficulties of combiningcareer and family life, the “two-body” problem, despite the adoption bymost universities of family-friendly “spousal hiring policies”; and the physicaltoll of long working hours and dedication. Other barriers might continue tobecome less pervasive over time, such as the prejudices that have contributedto the adoption of double-blind refereeing by many journals. The CanadianJournal of Statistics was among the first to adopt this policy, in 1990.In academia, examples of anomalies in salary and advancement may befewer these days. At least at the University of Waterloo, statistical approachessomething like those pioneered by Elizabeth L. Scott are now taken to identifyand rectify such anomalies. There are still however several departments acrossthe country with a surprisingly small number of women faculty in statistics.It used to be the case that in order to succeed in a field like statistics,awomanhadtoberesoluteanddetermined,andbepreparedtoworkveryhard. It was often said that she would have to work twice as hard as a man toachieve the same degree of recognition. It now seems to be the case that bothmen and women entering the field have to be either consummately brilliantor resolute and determined, to about the same degree — but it is still aneasier road for those who can afford to be single-minded. I remember the dayswhen women who pursued a career, particularly in academia, were consideredrather odd, and in some ways exceptional. We are now no longer so odd, butthose who make the attempt, and those who succeed, are still exceptional, inone way or another.AcknowledgmentsI would like to thank David Bellhouse, Christian Genest, and the reviewersfor very helpful suggestions.


214 Women in statistics in CanadaReferencesBeaud, J.-P. and Prévost, J.-G. (2000). L’expérience statistique canadienne.In The Age of Numbers: Statistical Systems and National Traditions (J.-P. Beaud and J.-G. Prévost, Eds.). Presses de l’Université duQuébec àMontréal, Montréal, Canada, pp. 61–86.Bellhouse, D.R. (2002). Isobel Loutit: Statistician of quality. SSC Liaison,16(2):14–19.Bellhouse, D.R. and Genest, C. (1999). A history of the Statistical Society ofCanada: The formative years (with discussion). Statistical Science, 14:80–125.Bellhouse, D.R. and Genest, C. (2003). A public health controversy in 19thcentury Canada. Statistical Science, 20:178–192.Clarke, B. (2003). A conversation with Constance van Eeden. SSC Liaison,17(4):28–35.Coats, R.H. (1939). Science and society. Journal of the American StatisticalAssociation, 34:1–26.Huntington, E.V. (1919). Mathematics and statistics, with an elementaryaccount of the correlation coefficient and the correlation ratio. AmericanMathematical Monthly, 26:421–435.Lawless, J.F. (2012). A conversation with Agnes Herzberg. SSC Liaison,26(4):40–45.Porter, T.M. (2004). Karl Pearson: The Scientific Life in a Statistical Age.Princeton: Princeton University Press, pp. 273–275.Rankin, L. (2011). Assessing the Risk: A History of Actuarial Science atthe University of Manitoba. WarrenCentreforActuarialStudiesandResearch,Asper School of Business, University of Manitoba, Winnipeg, MB.Thomas, D. (2010). The Census and the evolution of gender roles in early20th century Canada. Canadian Social Trends, Catalogue No 11–008, pp.40–46. http://www.statcan.gc.ca/pub/11-008-x/2010001/article/11125-eng.pdfWatts, D.G. (1984). Teaching statistics in Canada: The early days. TheCanadian Journal of Statistics, 12:237–239.Worton, D.A. (1998). Dominion Bureau of Statistics: A History of Canada’sCentral Statistical Office and its Antecedents, 1841–1972.McGill–Queen’sUniversity Press, Montréal, Canada.


M.E. Thompson 215Yarmie, A. (2003). “I had always wanted to farm.” The quest for independenceby English Female Emigrants at the Princess Patricia Ranch, Vernon,British Columbia: 1912–1920. British Journal of Canadian Studies,16:102–125.Zeller, S. (1996). Land of Promise, Promised Land: The Culture of VictorianScience in Canada. Canadian Historical Association Historical BookletNo 56.


20“The whole women thing”Nancy M. ReidDepartment of Statistical SciencesUniversity of Toronto, Toronto, ONIt is an honor and a pleasure to contribute to this celebratory volume, and I amgrateful to the editors for their efforts. The temptation to discuss non-technicalaspects of our profession and discipline has caught me too, and I have, withsome trepidation, decided to look at the past, present and future of statisticalscience through a gender-biased lens. In the past fifty years, a great deal haschanged for the better, for the position of women in science and in statisticalscience, but I believe we still have some way to go.20.1 IntroductionThe title of this chapter is a quote, as I remember it, from a dear friendand colleague. The occasion was a short discussion we had while rushing inopposite directions to catch talks at a Joint Statistical Meeting, probably inthe early 1990s, and he asked me if I might consider being nominated to runfor election as President of the Institute of Mathematical Statistics (IMS).Iwascompletelysurprisedbythequestion,andmyimmediatereactionswereto be honored that we were discussing it, and to assume that the question wasrhetorical. He said no, this was within the realm of possibility, and I shouldgive it careful consideration, for all the reasons one might expect: an honorfor me, a chance to influence an organization I cared about, etc. He ended bysaying “plus, you know, there’s the whole women thing. I guess you’d be thefirst.” In fact, Elizabeth Scott was the first woman President of the IMS, in1978.For various reasons, a number of unconnected events recently got me thinkingabout “the whole women thing.” Despite many years of on-and-off thinkingabout issues surrounding gender and a professional career, I find I still havealotofquestionsandnotmanyanswers.Ihavenotraininginsocialscience,nor in womens’ studies, nor in psychology, and no experience of what217


218 “The whole women thing”seems to me the difficult world outside academia. I will present a handful ofanecdotes, haphazardly chosen published studies, and personal musings. Myvantage point is very North American, but I hope that some of the issuesresonate with women in other countries as well.20.2 “How many women are there in your department?”In September, 2012, the small part of the blogosphere that I sometimes wanderthrough lit up with an unusual amount of angst. A study on gender bias (Moss-Racusin et al., 2012a,b) appeared in the Proceedings of the National Academyof Sciences (PNAS), not in itself so unusual, but this one got us rattled. By“us” I mean a handful of internet colleagues that worry about issues of womenin science, at least some of the time. I was alerted to the article throughIsabella Laba’s blog (Laba, 2012a, 2013), but the story was also picked upby many of the major news organizations. The PNAS paper reported on astudy in which faculty members in biology, chemistry and physics departmentswere asked to evaluate an application for a position as a student laboratorymanager, and the participants were told that this was part of a program todevelop undergraduate mentoring. From Moss-Racusin et al. (2012b):“Following conventions established in previous experimental work...,the laboratory manager application was designed to reflect slightly ambiguouscompetence, allowing for variability in participant responses...if the applicant had been described as irrefutably excellent, most participantswould likely rank him or her highly, obscuring the variabilityin responses to most students for whom undeniable competence is frequentlynot evident.”In other words, the applicant would likely not be at the top of anyone’s shortlist, but was qualified for the position. Faculty were asked to evaluate theapplication as if they were hiring the student into their own lab. The applicationswere all identical, but for half the scientists the student was named“John,” and for the other half, “Jennifer.”The headline story was that scientists rated applications from a male studenthigher than those from a female student. Scores assigned to qualitiesof competence, hireability, and mentoring, were systematically higher for themale student application, and“The mean starting salary offered the female student, $26,507.94, wassignificantly lower than that of $30,238.10 to the male student [t =3.42,P


N.M. Reid 219et al., 2012a), and the accompanying supplementary material (Moss-Racusinet al., 2012b). A particularly concerning result was that female scientists exhibitedthe same gender bias as their male counterparts.These results reverberated because they felt real, and indeed I felt I recognizedmyself in this study. During my many years on search committees,and during my five-year tenure as department chair, efforts to hire femaleresearch-stream faculty were not successful. It took me a surprisingly longtime to come to the conclusion that it was easy to decide to make an offer tothe best female candidate in any given hiring season, but much, much harderfor females in the next tier ‘down’ to be ranked as highly as the men in thesame tier. Our intentions were good, but our biases not easily identified. Withreference to the PNAS study, Laba says:“The scientists were not actively seeking to discriminate... They offeredsimilar salaries to candidates that they perceived as equally competent,suggesting that, in their minds, they were evaluating the candidatepurely on merit. The problem is that the female candidate wasjudged to be less competent, evidently for no reason other than gender,given that the resumes were exactly identical except for the name. [···]I’m sure that most of the participants, believing themselves unbiased,would be shocked to see the results.” (Laba, 2012a)I’ve presented this study in two talks, and mentioned it in a number ofconversations. The reaction from women is often to note other related studiesof gender bias; there are a number of these, with similar designs. An earlystudy of refereeing (Goldberg, 1968) involved submitting identical articles forpublication with the author’s name either Joan or John; this study featuredin the report of an IMS committee to investigate double-blind refereeing; seeCox et al. (1993). A more common reaction is to speculate more broadly onwhether or not women self-select out of certain career paths, are genuinelyless interested in science and so on. This deflects from the results of the studyat hand, and also diffuses the discussion to such an extent that the complexityof “the women thing” can seem overwhelming. Here is Laba in a related post:“Let’s recap what the study actually said: that given identical paperworkfrom two hypothetical job candidates, one male and one female,the woman was judged as less competent and offered a lower salary.This is not about whether girls, statistically speaking, are less interestedin science. It’s about a specific candidate who had already metthe prerequisites... and was received much better when his name wasJohn instead of Jennifer.” (Laba, 2012b)We all have our biases. The ABC News report (Little, 2012) on the paperdescribed a “small, non-random experiment,” but the study was randomized,and the authors provided considerable detail on the size of the study and theresponse rate. The authors themselves have been criticized for their bar chart


220 “The whole women thing”of the salary differential. By conveniently starting the y-axis at $25,000, thevisual appearance suggests a three-fold salary differential.That the biases are subtle, and often unconscious, is much better thanwhat many of the pioneers faced. Elizabeth Scott’s first ambition was to bean astronomer, but she soon realized that this was a hopeless career for awoman (Billard and Ferber, 1991). We’ve heard the jaw-dropping stories ofblatant sexism from days gone by, and to the extent that this is behind us, thisis progress of a sort. But biases that we don’t even recognize should concernus all.How to move forward? I hope this study will help: by highlighting biasesthat seem to be operating well below the surface, perhaps the next searchcommittees will work even harder. Genuine efforts are being made at myuniversity, and my department, and I believe at universities and departmentsaround the world, to increase the number of women hired, and to treat themwell. There is progress, but it seems to be slow and imperfect.20.3 “Should I ask for more money?”This might be the most common question I am asked by our graduating studentswho are considering job offers, in academia or not. Of course uncertaintyabout how to negotiate as one starts a career affects men and women, andthere are many aspects to the dialogue between a candidate and prospectiveemployer, including the hiring landscape at the time, the trade-off betweensalary and other aspects of the position, and so on.When I arrived at the University of Toronto in 1986, both the governmentof Canada, and the government of Ontario, had passed pay equity laws, enshriningthe principle of “equal pay for work of equal value.” This is broaderthan “equal pay for equal work,” which had already been in force for someyears in most jurisdictions in Canada. The laws led to a flurry of work onpay equity at that time, and one of my first consulting projects, undertakenjointly with Ruth Croxford of our department, was to review the salaries ofall faculty at the University of Toronto, with a view to identifying pay equityissues.I was at the time unaware of a strong legacy for statistical analyses offaculty salaries initiated by Elizabeth Scott (Gray and Scott, 1980; Scott,1975), who led the charge on this issue; see Billard and Ferber (1991). LynneBillard made important follow-up contributions to the discussion in 1991 and1994 (Billard, 1991, 1994). Lynne’s 1994 paper asked: “Twenty years later: Isthere parity for academic women?”; her answer was “no” (the title referred to aUS government law, Title IX of the Education Amendments, enacted in 1972).This continues to be the case. For example, the University of British Columbia(UBC) recommended in October 2012 an across-the-board salary increase of


N.M. Reid 2212% to all female faculty (Boyd et al., 2012). The detailed review of universitypractices on this issue in the UBC report noted that many universities haveimplemented ongoing reviews and adjustments of female faculty salaries on aregular schedule.As with hiring, it is easy to get distracted in the salaries debate by potentialfactors contributing to this. One is rank; it continues to be the case thatwomen are under-represented in the professorial rank; the London MathematicalSociety has just published a report highlighting this fact in mathematics(London Mathematical Society, 2013). A point very clearly explained in Grayand Scott (1980) is that systematic differences in rank are themselves a form ofgender bias, so not appropriate defence against salary remedies. Even settingthis aside, however, the UBC report concluded that after adjusting for rank,departmental unit, merit pay and experience, “there remains an unexplainedfemale disadvantage of about $3000” (Boyd et al., 2012).It may be the case that on average, women are less aggressive in negotiatingstarting salaries and subsequent raises, although of course levels of skillin, and comfort with, negotiation vary widely across both genders. Other explanations,related to publication rate, lack of interest in promotion, time offfor family matters, and so on, seem to need to be addressed in each equityexercise, although this seems to me to be once again “changing the subject.”As just one example in support of this view, the UBC report, referring toan earlier salary analysis, concluded: “our assumptions would be supportedby a more complete analysis, and... parental leave does not alter the salarydisadvantage.”Over the years I have often met statistical colleagues, usually women, whowere also asked by their university to consult on an analysis of female facultysalary data, and it often seemed that we were each re-inventing the wheel.The pioneering work of Gray and Scott (Gray and Scott, 1980) touches on allthe main issues that are identified in the UBC report, and it would be goodto have a central repository for the now quite large number of reports fromindividual universities, as well as some of these key references.20.4 “I’m honored”The Elizabeth L. Scott Award was established by COPSS in 1992 to honorindividuals who have “helped foster opportunities in statistics for women.”The first winner was Florence Nightingale David, and the second, Donna Brogan,said in her acceptance speech that she looked forward to the day whensuch awards were no longer needed. While women are well-represented in thisvolume, I think that day is not here yet.I was involved in a number of honor-related activities over the past yearor two, including serving on the Program Committee for a major meeting,


222 “The whole women thing”on an ad hoc committee of the Bernoulli Society to establish a new prize instatistics, and as Chair of the F.N. David Award Committee.On the Program Committee, the first task for committee members wasto create a long list of potential candidates for the plenary lecture sessions.In a hurry, as usual, I jotted down my personal list of usual suspects, i.e.,people whose work I admire and who I thought would give interesting andimportant presentations. I examined my list more critically when I realizedthat my usual suspects were all about the same age (my age or older), andrealized I’d better have another think. Some on my list had already givenplenary lectures at the same meeting in earlier years, so off they came. I’membarrassed to admit that it took me several passes before I realized I hadno women on my list; another round of revisions was called for. At that pointIdidsomeresearch,anddiscoveredthatforthisparticularmeeting,therehadbeen no women plenary speakers since 1998, which seemed a pretty long time.Then things got interesting. I sent an email to the Program Committeepointing this out, and suggesting that we should commit to having at leastone female plenary lecturer. Email is the wrong medium in which to rationalizeopposing views, and it turned out there were indeed opposing, as well assupporting, views of this proposal. Extremes ranged from “I do not considergender, race or any other non-scientific characteristics to be relevant criteria”to “I do find it important for the field and for the meeting that female researchersare well represented.” Without the diplomatic efforts of the Chair,we might still be arguing.What did I learn from this? First, we all have our biases, and it takes someeffort to overcome them. The more well-known people are, the more likely theyare to be suggested for honors, awards, plenary lectures, and so on. The olderthey are, the more likely they are to be well-known. Statistical science is aging,and we have a built-in bias in favor of established researchers, that I thinkmakes it difficult for young people to get the opportunities and recognitionthat I had when I was young(er). Second, our biases are unintentional, muchas they surely were for the scientists evaluating lab manager applications. Weare all busy, we have a lot of demands on our time, and the first, quick, answeris rarely the best one. Third, it is important to have women on committees.I wish it were not so; I have served on far too many committees in my career,and every woman I know says the same thing. It turned out I was the onlywoman on this particular Program Committee, and while I had good supportfrom many members, I found it lonely.The Bernoulli Society recently established the Wolfgang Doeblin prize inprobability (see Bernoulli Society (2012)), and I chaired an ad hoc committeeto consider a new prize in statistics. Wolfgang Doeblin died in 1940, just25 years old, and shortly before his death wrote a manuscript later found tocontain many important ideas of stochastic calculus (Göbel, 2008). The awardis thus given to a single individual with outstanding work, and intended forresearchers at the beginning of their mathematical career.


N.M. Reid 223Aparalleltothiscouldbeaprizeforstatisticiansatthebeginningoftheircareer, and my personal bias was to restrict the award to women. On discussionwith colleagues, friends, and members of the committee, it became apparentthat this was not quite as good an idea as I had thought. In particular, itseemed likely that a prize for women only might have a negative connotation— “not a ‘real’ prize” — not among the curmudgeonly senior colleagues ofthe awardees, but among the potential awardees themselves. In fact I am sureIwouldhavefeltthatwaymyself.On to the next challenge. We decided to recommend naming the prize afterafemalestatistician,nolongerliving.Well,wealreadyhavetheE.L.Scottprize, the F.N. David Award, and the Gertrude Cox scholarship. Quick, howmany can you name? How many of your colleagues will recognize the name?We discovered Ethel Newbold (1882–1933), the first woman to be awardedaGuyMedalinSilverfromtheRoyalStatisticalSociety,in1928.Wealsodiscovered that the second woman to be awarded a Guy Medal in Silver wasSylvia Richardson, in 2002. Get nominating, ladies! We needn’t feel smug onthis side of the Atlantic, either; see Gray and Ghosh-Dastidar (2010) and Palta(2010).The F.N. David Award is the only international award in statistical sciencesthat I am aware of that is restricted to women. It was established jointlyby COPSS and the Caucus for Women in Statistics in 2001. The nomineesthis year were amazing, with nomination letters and vitae that could inducestrong feelings of inadequacy in any reader. But a side remark from one nominatorgot me thinking. The nominator pointed out that the selection criteriafor the award were daunting indeed, although his nominee did indeed fulfillall the criteria, and then some. I had a more careful look at these criteria, andthey are“Excellence in the following: as a role model to women; statisticalresearch; leadership in multidisciplinary collaborative groups; statisticseducation; service to the profession.”Hello Caucus! We are much too hard on each other! But perhaps I’m beingunfair, and the intention was “one of,” rather than “all of.” I can say though,that our leading female colleagues do seem to manage to excel in “all of.”Here’s Laba again on a similar point:“The other way to make progress, of course, is for women to be ‘twiceas good,’ [···]That’swhatmanywomeninsciencehavebeendoingall along. It takes a toll on us. It’s not a good solution. Unfortunately,sometimes it’s the only one we’ve got.” (Laba, 2012b)


224 “The whole women thing”20.5 “I loved that photo”The 1992 Joint Statistical Meetings coincided with our first daughter’s firstbirthday, so we joined the legions of families who combine meetings withchildren. If you’ve done this you’ll know how hard it can be, and if you haven’t,well, be assured that it is an adventure. George Styan snapped a picture ofme with Ailie in a back carrier, and this photo ended up being printed in theIMS Bulletin. AtthetimeIwasquiteembarrassedbythis—Ithoughtthatthis photo would suggest that I wasn’t taking the meeting, and by extensionresearch, seriously enough. But in fact I received so many emails and notes andcomments from colleagues, expressing the sentiment in the section heading,that in the end I was, and am, grateful to George for his sixth sense for a goodsnapshot.For me the most difficult aspect of the discussion around women andacademia is children. Decisions around having and raising children are sodeeply personal, cultural, and emotional, that it often seems better to leavethis genie in the bottle. It is also the most disruptive part of an academiccareer, and by and large still seems to be more disruptive for women than formen. It risks being a two-edged sword. If differences in opportunities are tiedto child-rearing, then is there a temptation to assume that women withoutchildren face no hurdles, or that men who choose to become more involved inchild care should be prepared to sacrifice advancement at work? Again it iseasy to get distracted by the complexity and depth of the issues, and lose themain thread.The main thread to me, is the number of women who ask me whether ornot it is possible to have an academic career in a good department and stillhave time for your family. When is the ‘best’ time to have children — gradschool? Post-doc? Pre-tenure? If I wait until I have tenure, will I have waitedtoo long? What about promotion to full professor? I don’t know the answer toany of these questions, but I do know women who have had children at eachof these stages, and who have had, and are having, very successful academiccareers.Icanonlyspeakforacademia,butexceptionallytalentedandsuccessfulwomen are speaking about government and industry: Anne-Marie Slaughter,(Slaughter, 2012) and Sheryl Sandberg (Sandberg, 2013), to name just two.Slaughter, a Princeton professor who spent two years in a high profileposition in Washington, DC, writes:“I still strongly believe that women can ‘have it all’ (and that mencan too). I believe that we can ‘have it all at the same time.’ But nottoday, not with the way America’s economy and society are currentlystructured. My experiences over the past three years have forced meto confront a number of uncomfortable facts that need to be widelyacknowledged — and quickly changed.” (Slaughter, 2012)


N.M. Reid 225Happily for most readers of this article, Slaughter contrasts the flexibilityand freedom she has with her academic appointment with the demands of ahigh-profile position in government. While the very heavy demands on hertime at Princeton would make most of us weep, “I had the ability to set myown schedule most of the time. I could be with my kids when I needed to be,and still get the work done” (Slaughter, 2012). This article is a great referenceto provide to your graduate students, if they start wondering whether anacademic career is too difficult to combine with family life. However, much ofwhat she describes as barriers to women in high-level government positionsresonates in a thoughtful analysis of the difficulties faced by women movinginto the leadership ranks of universities; see Dominici et al. (2009).Sandberg has been criticized for implying that at least some of the barriersfor advancement of women are created by the womens’ own attitudes,particularly around family, and she exhorts women to make sure that they areaggressively pursuing opportunities. The position set out in her book (Sandberg,2013) is much more nuanced than that, but the notion that women aresometimes their own worst enemies did resonate with me. In an interesting radiointerview with the BBC (BBC News, 2013), Sandberg suggested that thephrases “work-life balance” and “having it all” should de facto be mistrusted,as they are themselves quite gender-specific.Recently I visited the lovely new building that is now home to the Departmentof Statistics at North Carolina State University. As it happened, therewas a career mentoring event taking place in the department at the sametime, and over coffee I met an enthusiastic young woman who is completingaPhDinstatistics.Herfirstquestionwasaboutbalancingcareerandfamilyin a tenure-stream position; I think I relied on the rather bland “advice”mentioned above. But the most encouraging part of the day was the tour ofthe department. There, between the faculty offices, department lounge, seminarrooms, and banks of computer terminals, was a wonderful sight: a BabyRoom! I imagine that Gertrude Cox would be surprised to see this, but I hopeshe would also be proud of the department she founded.20.6 ConclusionThe position of women in academia is vastly improved from the days whenElizabeth Scott was discouraged from studying astronomy, and from the dayswhen my probability professor could state in class that “women are not suitedfor mathematics.” Determined and forceful pioneers through the 1950s and1960s, followed by much larger numbers of female students from the 1970s on,has meant that women do have many opportunities to succeed in academicwork, and many are succeeding on a number of levels.


226 “The whole women thing”Ifinditdifficulttodiscussalltheseissueswithoutseemingplaintive.Iwritefrom a privileged position, and I can say without hesitation that I personallydo not feel disadvantaged; my career has coincided with a concerted effort tohire and promote women in academia. And yet I’ve felt the energy drain fromtrying to tackle some of the issues described here. I’ve experienced the welldocumentedattrition through the ranks: although my undergraduate statisticsclass had nine women in a class of 23, and my graduating PhD class hadfour women and three men, I continue to be the only female research streamfaculty member in my department. While I enjoy my colleagues and I love myjob, I believe this stark imbalance means our department is missing out onsomething intangible and important.So while I don’t stress about gender issues all the time, I do find that afterall these years there still are many things to discuss, to ponder, to wonderover, and with luck and determination, to solve.AcknowledgementsI was fortunate to meet as an undergraduate three (then future) leaders ofour discipline: Lynne Billard, Jane Gentleman, and Mary Thompson; this ismy personal proof that role models are essential. I would like to thank themfor their contributions to, and achievements for, our profession.ReferencesBBC News (2013). Powerful women are less liked, April. http://www.bbc.co.uk/news/business-22189754. Accessed on February 11, 2014.Bernoulli Society (2012). Wolfgang Doeblin Prize, 2012. http://www.bernoulli-society.org/index.php/prizes/158. Accessed on February11, 2014.Billard, L. (1991). The past, present, and future of academic womenin the mathematical sciences. Notices of the American MathematicalSociety, 38:714–717. http://www.awm-math.org/articles/notices/199107/billard/index.html. Accessed on February 11, 2014.Billard, L. (1994). Twenty years later: Is there parity for women inacademia? Thought and Action, 10:114–144. http://www.stat.uga.edu/stat_files/billard/20yearslater.pdf. Accessed on February11, 2014.


N.M. Reid 227Billard, L. and Ferber, M.A. (1991). Elizabeth Scott: Scholar, teacher, administrator.Statistical Science, 6:206–216.Boyd, L., Creese, G., Rubuliak, D., Trowell, M., and Young, C.(2012). Report of the gender pay equity recommendation committee.http://www.facultyassociation.ubc.ca/docs/news/GenderPayEquity_JointCommunique.pdf. Accessed on February11, 2014.Cox, D.R., Gleser, L., Perlman, M.D., Reid, N.M., and Roeder, K. (1993).Report of the ad hoc committee on double-blind refereeing. StatisticalScience, 8:310–317.Dominici, F., Fried, L.P., and Zeger, S. (2009).Academe, 95:25–27.So few women leaders.Göbel, S. (2008). The mathematician Wolfgang Doeblin (1915–1940)— searching the internet. In A Focus on Mathematics (Prof.Dr. Bernd Wegner and Staff Unit Communications, Eds.), pp. 31–34.Fachinformationszentrum Karlsruhe, Karlsruhe, Germany. http://www.zentralblatt-math.org/year-of-mathematics/. Accessed on February11, 2014.Goldberg, P.A. (1968). Are women prejudiced against women? Transactions,5:28–30.Gray, M.W. and Ghosh Dasidar, B. (2010). Awards for women fall short.AmStat News, October.http://magazine.amstat.org/blog/2010/10/01/awardswomenfallshort/. Accessed on February 11, 2014.Gray, M.W. and Scott, E.L. (1980). A “statistical” remedy for statisticallyidentified discrimination. Academe, 66:174–181.Laba, I. (2012a). Biased. http://ilaba.wordpress.com/2012/09/25/biased/. Accessed on February 11, 2014.Laba, I. (2012b). The perils of changing the subject. http://ilaba.wordpress.com/2012/10/02/the-perils-of-changing-the-subject/. Accessed on February 11, 2014.Laba, I. (2013). Gender bias 101 for mathematicians. http://ilaba.wordpress.com/2013/02/09/gender-bias-101-formathematicians/.Accessed on February 11, 2014.Little, L. (2012). Women studying science face gender bias,study finds. http://abcnews.go.com/blogs/business/2012/09/women-studying-science-face-gender-bias-study-finds/. Accessedon February 11, 2014.


228 “The whole women thing”London Mathematical Society (2013). Advancing Women in Mathematics:Good Practice in UK University Departments. Technical report, LondonMathematical Society. Prepared by Sean MacWhinnie and Carolyn Fox,Oxford Research and Policy.Moss-Racusin, C.A., Dovidio, J.F., Brescoll, V.L., Graham, M.J., and Handelsman,J. (2012a). Science faculty’s subtle gender biases favor malestudents. Proceedings of the National Academy of Science, 109:16474–16479.Moss-Racusin, C.A., Dovidio, J.F., Brescoll, V.L., Graham, M.J., andHandelsman, J. (2012b). Supporting information: Moss-Racusin etal. 10.1073/pnas.1211286109, 2012. www.pnas.org/cgi/content/short/1211286109. Accessed on February 11, 2014.Palta, M. (2010). Women in science still overlooked. Amstat News,October. http://magazine.amstat.org/blog/2010/10/01/women-inscience/.Accessed on February 11, 2014.Sandberg, S. (2013). Lean In: Women, Work and the Will to Lead. RandomHouse, New York.Scott, E.L. (1975). Developing criteria and measures of equal opportunityfor women. In Women in Academia: Evolving Policies Towards EqualOpportunities (A. Lewin, E. Wasserman, and L. Bleiweiss, Eds.). Praeger,New York, pp. 82–114.Slaughter, A.M. (2012). Why women still can’t have it all. Atlantic, 310:85–102.


21Reflections on diversityLouise M. RyanSchool of Mathematical SciencesUniversity of Technology, Sydney, Australia21.1 IntroductionIrecallquitevividlythestartintheearly1990sofmyinterestinfosteringdiversity in higher education. Professor James Ware had just been appointedas the Academic Dean at Harvard School of Public Health and was letting goof some of his departmental responsibilities. He asked if I would take over asdirector of the department’s training grant in environmental statistics, fundedthrough the National Institute of Environmental Health Sciences (NIEHS).Being an ambitious young associate professor, I eagerly accepted. It wasn’tlong before I had to start preparing the grant’s competitive renewal. Thesewere the days when funding agencies were becoming increasingly proactive interms of pushing Universities on the issue of diversity and one of the requiredsections in the renewal concerned minority recruitment and retention. Notknowing much about this, I went for advice to the associate dean for studentaffairs, a bright and articulate African American woman named Renee (not hertrue name). When I asked her what the School was doing to foster diversity,she chuckled and said “not much!” She suggested that I let her know whenI was traveling to another city and she would arrange for me to visit somecolleges with high minority enrollments so that I could engage with studentsand teachers to tell them about opportunities for training in biostatistics atHarvard.Not long after this, I received an invitation to speak at the University ofMississippi in Oxford, Mississippi. I excitedly called Renee to tell her aboutmy invitation, naively commenting that since I would be visiting a universityin the South, there must be lots of minority students there with whom I couldtalk about opportunities in Biostatistics. She laughed and said “Louise, it’sa bit more complicated than that...” She went on to tell me about some ofthe history associated with “Ole Miss,” including the riots in the early 60striggered by the brave efforts of African American, James Meredith, to enroll229


230 Reflections on diversityas a student — for a fascinating, “can’t put the book down” read, try NadineCohodas’s “The Band Played Dixie” (Cohodas, 1997). Renee encouraged meto accept the invitation, but also offered to arrange visits to a couple of otherschools within driving distance of Oxford, that had high enrollments of minoritystudents. Rust College and Lemoyne-Owen College were both membersof the network of Historically Black Colleges and Universities (HBCU). Thisnetwork comprises 105 schools that were originally established in the daysof segregation to educate black students, but which continue today, proud oftheir rich heritage and passionate about educating African American studentsand preparing them to become tomorrow’s leaders. While students of any raceor ethnicity can apply for admission to a HBCU, the majority of students areof African American heritage. Some HBCUs, especially the more well-knownones such as Howard University in Washington, DC, and Spelman Collegein Atlanta, are well endowed and have the same atmosphere of privilege andlearning that one finds on so many modern liberal arts college campuses. Others,while unquestionably providing a sound college education, were not sowealthy. Rust and Lemoyne–Owen Colleges were definitely in the latter category.My visit to those two colleges felt to be in stark contrast to the senseof wealth and privilege that I encountered at Ole Miss. The experience forme was a major eye-opener and I came away with a sense of determinationto do something to open up an avenue for more minority students to pursuegraduate work in biostatistics.21.2 Initiatives for minority studentsSerendipitously, the NIEHS had just announced the availability of supplementaryfunds for universities with existing training grants to establish summerprograms for minority students. We successfully applied, and the next summer(1992) ran our first ever Summer Program in Biostatistics, with six mathmajorsfrom various HBCUs, including one student from Lemoyne–Owen. The4-week program comprised an introductory course in biostatistics, along witha series of faculty seminars designed to expose students to the breadth ofinteresting applications in which biostatisticians engage. We also organizedpractical sessions focussed on things such as how to prepare for the GraduateRecord Examination (GRE) and tips on applying for graduate school.We built in lots of time for the summer students to meet more informallywith students and faculty from our department. Finally, we organized varioussocial activities and outings in Boston, always involving department studentsand faculty. Our goal was to create an immersive experience, with a view togiving the students a taste of what a graduate experience might be, and especiallydemystifying the Harvard experience. I recall very clearly one of ourearlier participants saying that without having attended the Program, she


L.M. Ryan 231would never have even considered applying for graduate studies, particularlyat a place like Harvard. This particular student did apply to Harvard, was acceptedand went on to be one of the strongest students in her class. She is nowa successful faculty member at a major university near Washington, DC, andhas published her work in top statistical journals. This student’s story is justone of many similar ones and represents a measurable successful outcome ofthe Program. After a few years of running the Summer Program, we succeededin winning a so-called IMSD (Initiative for Minority Student Development)grant available at that time from the National Institute of General MedicalSciences. This grant was much larger, supporting not only an expansion ofour Summer Program, but also providing funds for doctoral and postdoctoraltraining and expanding beyond biostatistics into other departments.The IMSD grant had a major impact on the Department and the Schoolas a whole. It strengthened the legitimacy of our diversity efforts by generatingsubstantial funds and also by influencing the nature of the researchthat many of us were doing. The IMSD grant required us to develop a strongresearch theme, and we had chosen to look at the development and applicationof quantitative methods for community-based research, with a strongemphasis on understanding and reducing health disparities. While it wouldbe an inappropriate generalization to expect that every minority student willbe interested in the study of health disparities, the reality was that manyof our students were. I’ll come back to this point presently, but I believe animportant element of academic success is giving students the opportunity topursue research in areas that ignite their passion. Embracing diversity willinevitably involve being exposed to new ideas and perspectives and this wasjust one example of how that played out in the department. We ran a weeklyseminar/discussion group that provided an opportunity to not only have formalseminars on health disparities research, but also to create a supportivecommunity where the students could talk about the various issues, academicand other, that they were encountering.21.3 Impact of the diversity programsOur Diversity programs had profound impacts, over time. I think it is fair tosay that the students weren’t exposed to overtly racist attitudes, certainly notof the extreme kind described in Nadine Cohodas’s book. However, they weremost definitely affected by many subtle aspects, especially in the early days ofthe Program. Examples included faculty expectations of lowered performanceor resentment from fellow students at a perception of special treatment. Bymaking such observations, I am not trying to criticize or cast judgment, oreven excluding myself from having stereotyped views. As discussed by MalcolmGladwell in his excellent book entitled “Outliers” (Gladwell, 2011), virtually


232 Reflections on diversityall of us do, and this can have strong and negative effects. Gladwell discussesextensively the impact, positive and negative, of the social environment onacademic performance and general success in life. He also describes some veryinteresting experiments designed to measure very subtle aspects of negativeracial stereotyping. Recognizing our own tendency to stereotype others is infact an important first step towards making progress towards a more equitablesociety. Working closely with the students over so many years providedan opportunity for me to observe and, to at least some extent, empathizewith the challenges of being a minority student, even in today’s relativelyenlightened educational system. Internal and external expectations of underperformancevery easily turn into reality. Self-doubt can erode confidence,leading students to isolate themselves, thus cutting themselves off from thebeneficial effects of being in student study groups. On the flip side, however,we saw the positive and reinforcing effects of growing numbers and studentsuccess stories. I will never forget the shock we all experienced one year whena particularly bright young African American man failed the department’sdoctoral qualifying exam. To his credit, he dusted himself off and developed asteely determination to succeed the following year. He did so with flying colors,winning the departmental award for the top score in the exam (an awardthat is assigned purely on the basis of exam performance and blind to studentidentity). That same year, another African American, a young woman, alsofailed the exam. Although devastated, she was also determined to not onlytry again, but to repeat the outstanding performance of her classmate andwin the prize. And she did. I still get cold shivers thinking about it! Thesewere the kinds of things, along with growing critical mass, that got thingschanging. It is quite awe-inspiring to think about what some of our programgraduates are doing today and how through their success they are inspiringand encouraging the next generation to thrive as well.I don’t feel like I have the language or skill to describe many of the profoundthings that I learned and experienced through directing the minority programat Harvard for so many years. However I recently read an excellent book,“Whistling Vivaldi,” (Steele, 2011) by someone who does — Claude Steele,a renowned social scientist and Dean of the Graduate School of Educationat Stanford. Much of Steele’s work has been on the concept of stereotypethreat. The idea is that when a person is being evaluated (e.g., through atest), their performance can be significantly undermined if they believe thatthe evaluators will be looking at them through the lens of a stereotype. Whilestereotyping happens anytime where there are broad-based characterizationsof a person’s ability or character, based on their social standing, ethnicity orrace, the ones that most readily come to mind in the educational context aregender and math/science ability as well as race and general academic performance.Steele describes some fascinating experiments where test scores canbe significantly impacted according to whether or not subjects are consciousof stereotype threat. Not only a great read, “Whistling Vivaldi” is a definiteeye-opener.


L.M. Ryan 23321.4 Gender issuesI’ve thought a lot over the years about the issue of gender in our field of statistics.During my almost thirty years in US academia, I was never particularlyconscious of experiencing any obvious bias or discrimination because I was awoman. I was even beginning to think that the days were over where specialefforts were still needed to encourage and support women in science. In fact,IevenfeltslightlyguiltywhenIreceivedtheCOPSSawardthatrecognizesElizabeth L. Scott’s lifelong efforts in the furtherance of the careers of women.Since returning to Australia in early 2009, however, my thinking has changedabit.I’vefoundtheresearchenvironmentheremuchhardertonavigatethanin the US and my confidence has suffered as a result. At a meeting of theAustralian Academy of Science earlier this year, I had something of a lightbulbmoment talking with Terry Speed and several others who assured methat the problem wasn’t just me, but rather I was experiencing the impact ofworking in an environment that was inherently more difficult for women thanfor men. A telling symptom of this was that none of the 20 new fellows electedto the Australian Academy of Science in 2013 were women! While this situationwas something of an embarrassment to the Academy, it did provide animportant opportunity for collective self reflection and dialogue on the issueof gender diversity in Australian science. I realized that I was struggling withsome of the same challenges that I had worked so hard years earlier to helpmy students overcome. Because there were fewer successful academic womenin Australia, I felt more isolated. Also, because the guidelines for assessingsuccess reflected a more male perspective, I was not measuring up so well. Forexample, because of some family responsibilities, I was generally reluctant toaccept many invitations to speak at international conferences. However, suchactivities were seen as very important when it came to evidence of track recordfor grant applications. Finally, my interests didn’t quite align. In the US, I hadbeen very fortunate to spend my career in an environment that embraced interdisciplinaryresearch and where the model of a biostatistician combiningcollaborative and methodological research was not only well understood, butseen as an ideal. In Australia, the model was a more traditional one of a successful,independent academic heading up a team of students, postdocs andjunior staff. For me, this model just didn’t fit. For all these reasons, it madesense that I was having some difficulty in finding my place within the Australianacademic environment. But instead of seeing this bigger picture, I waspersonalizing it and starting to believe that I simply didn’t have the talent tosucceed. I see now that I have an opportunity to put into practice some of theadvice I used to give my students about believing in myself, keeping in mindthe bigger picture and understanding that by persevering, I can help changethe system. My experience also underscores why having a diverse workforcehelps the whole system to be healthier and more effective. A diverse work-


234 Reflections on diversityforce means a diversity of opinions and values, and a diversity of approachesto problem solving. Diversity broadens the scope of what’s important, howworkplaces are organized and how people are valued. In the end, a diverseworkplace is good for everyone.ReferencesCohodas, N. (1997). The Band Played Dixie: Race and the Liberal Conscienceat Ole Miss. FreePress.Gladwell, M. (2011). Outliers: The Story of Success. BackBayBooks,NewYork.Steele, C.M. (2011). Whistling Vivaldi: How Stereotypes Affect Us and WhatWe Can Do. W.W. Norton & Co, New York.


Part IVReflections on thediscipline


22Why does statistics have two theories?Donald A.S. FraserDepartment of Statistical SciencesUniversity of Toronto, Toronto, ONThe public image of statistics is changing, and recently the changes have beenmostly for the better, as we’ve all seen. But occasional court cases, a few conspicuousfailures, and even appeals to personal feelings suggest that carefulthought may be in order. Actually, statistics itself has more than one theory,and these approaches can give contradictory answers, with the disciplinelargely indifferent. Saying “we are just exploring!” or appealing to mysticismcan’t really be appropriate, no matter the spin. In this paper for the COPSS50th Anniversary Volume, I would like to examine three current approachesto central theory. As we will see, if continuity that is present in the model isalso required for the methods, then the conflicts and contradictions resolve.22.1 IntroductionL’Aquila and 300 deaths. The earthquake at L’Aquila, Italy on April 5, 2009had been preceded by many small shocks, and Italy’s Civil Protection Departmentestablished a committee of seismologists to address the risks of a majorearthquake. The committee reported before the event that there was no particularlygood reason to think that a major earthquake was coming and theDepartment’s Deputy Head even allowed that the small shocks were reducingthe seismic stresses, lowering the chances of a major quake. This gave somereassurance to many who were concerned for their lives; but the earthquakedid come and more than 300 died. For some details, see Pielke (2011). Chargeswere then brought against the seismologists and seven were sentenced to sixyears in prison for manslaughter, “for falsely reassuring the inhabitants ofL’Aquila.” Part of the committee’s role had been the communication of theirfindings, statistics being intrinsically involved. See Marshall (2012) and Prats(2012).237


238 Statistics’ two theoriesVioxx and 40,000 deaths. The pain killer Vioxx was approved by the USFood and Drug Administration (FDA) in 1999 after a relatively short eightyears in the approval process and then withdrawn by the pharmaceutical companyMerck in 2004 after an acknowledged excess of cardiovascular thrombotic(CVT) events under Vioxx in a placebo controlled study. But statistical assessmentsas early as 2000 had indicated the heightened rate of CVT eventswith the use of Vioxx. Statistician David Madigan of Columbia Universityrose to the challenge as litigation consultant against Merck, and a five billiondollar penalty against Merck went to the injured and their survivors; some feltthis was a bargain for Merck, as the company had made billions in profit fromthe drug. One estimate from the FDA of the number of deaths attributed tothe use of the drug was 40,000. See Abraham (2009).Challenger and 7 deaths. ThespaceshuttleChallengerhadcompletedninesuccessful flights but on its tenth take-off on January 28, 1986 disintegratedwithin the first two minutes. The failure was attributed to the breakdown ofan O-ring on a solid rocket booster. The external temperature before the flightwas well below the acknowledged tolerance for the O-rings, but the flight wasgiven the go-ahead. The 7 crew members died. See Dalai and Fowlkes (1989)and Bergin (2007).The preceding events involve data, data analysis, determinations, predictions,presentations, then catastrophic results. Where does responsibility fall?With the various levels of the application of statistics? Or with the statisticaldiscipline itself with its contradicting theories? Or with the attitude of manystatisticians. We are just exploring and believe in the tools we use?Certainly the discipline of statistics has more than one theory and thesecan give contradictory results, witness frequency-based, Bayes-based, andbootstrap-based methodology; these provide a wealth of choice among thecontraindicating methods. Here I would like to briefly overview the multipletheories with a view to showing that if continuity as present in the typicalmodel is also required for the methods, an equivalence emerges among thefrequency, the bootstrap, and partially the Bayesian approach to inference.But also, there is attitude within the discipline that tolerates the contradictionsand indeed affects within-discipline valuations of statistics and statisticians.In recent years, an important Canadian grant adjudication process hadmathematicians and statisticians evaluating applications from mathematiciansand statisticians using standardized criteria but with a panel from mathematicsfor the mathematicians and a panel from statistics for the statisticians;and it was found that mathematicians rate mathematicians much higher thanstatisticians rate statisticians, even though it was clear that benefits would beapportioned accordingly. For details, see Léger (2013). The contradictory theoryand the contradictory attitude provide a potential for serious challengesfor statistics, hopefully not at the level of L’Aquila, Vioxx and Challenger.


D.A.S. Fraser 23922.2 65 years and what’s newI did my undergraduate studies in mathematics in my home town of Toronto,Ontario. An opportunity to study analysis and algebra in the doctoral programat Princeton University arose in 1947. But then, with a side interestin actuarial things, I soon drifted to the Statistics Group led by Sam Wilksand John Tukey. A prominent theme was Neyman–Pearson theory but a persistentseminar interest focussed on Fisher’s writings, particularly those onfiducial inference which had in turn triggered the Neyman (Neyman, 1937)confidence methodology. But also, a paper by Jeffreys (Jeffreys, 1946) keptreemerging in discussions; it offered a default Bayes (Bayes, 1763) approach,often but incorrectly called objective Bayes in present Bayes usage. The strikingthing for me at that time was the presence of two theories for statisticsthat gave contradictory results: if the results were contradictory, then simplelogic on theories says that one or the other, or both, are wrong. This latterview, however, was not part of the professional milieu at the time, thoughthere was some puzzlement and vague acceptance of contradictions, as beingin the nature of things; and this may even be part of current thinking! “Oneor the other, or both, could be wrong?” Physics manages to elicit billions intaxpayer money to assess their theories! Where does statistics stand?With a completed thesis that avoided the frequency-Bayes contradictions,I returned to Canada and accepted a faculty position in the Department ofMathematics at the University of Toronto. The interest in the frequency-Bayes contradictions, however, remained and a conference talk in 1959 andtwo resulting papers (Fraser, 1961a,b) explored a broad class of statisticalmodels for which the two approaches gave equivalent results: the locationmodel f(y − θ), of course, and the locally-generated group extensions, thetransformation-parameter models. Then an opportunity for a senior facultyposition in the Mathematics Department at Princeton arose in 1963, but I wasunable to accept. The concerns for the frequency-Bayes contradictions, however,remained!Now in 2013 with COPSS celebrating its 50th anniversary, we can lookabout and say “What’s new?” And even more we are encouraged to reminisce!There is very active frequency statistics and related data analysis; andthere is very active Bayesian statistics; and they still give contradictory answers.So nothing has changed on the frequency-Bayes disconnect: what goesaround comes around... Does that apply to statistical theory in the 65 yearsIhavebeenintheprofession?Oh,ofcourse,therehavebeenmassiveextensionsto data exploration, to computer implementation, to simulations, and toalgorithmic approaches. Certainly we have Precision, when sought! But whatabout Accuracy? I mean Accuracy beyond Precision? And what about thefrequency-Bayes contradictions in the theory? And even, indeed, the fact thatno one seems to care? And then L’Aquila, Vioxx, Challenger, and of course the


240 Statistics’ two theoriescontradictory theory? Are perceptions being suppressed? It might wind up ina court, as with L’Aquila, an inappropriate place to address a scientific issuebut perhaps not to address a conflict coming from discipline contradictions.Well yes, something has changed! Now a general feeling in the milieu isacceptance of the frequency-Bayes contradiction: it just doesn’t matter, weare just exploring; our models and calculations are just approximations; andwe can acquire any Precision we want, even though we may not have used thefull information provided by the model, so just run the MCMC longer, eventhough several million cycles only give two decimal places for some wantedprobability or confidence calculation. Or put together an algorithm for processingnumbers. Or use personal feelings as in some Bayes methods.But even for explorations it certainly behooves one to have calibratedtools! And more generally to know with Precision and Accuracy what a modelwith data implies? Know as a separate issue quite apart from the descriptiveAccuracy of the model in a particular context, which of course in itself isan important but separate issue! This Accuracy is rarely addressed! Indeed,as L’Aquila, Vioxx, and Challenger indicate, a concern for Accuracy in theend products of statistics may have an elusive presence in many professionalendeavours. An indictment of statistics?22.3 Where do the probabilities come from?(i) The starting point. Thestatisticalmodelf(y; θ) withdatay 0 formsthe starting point for the Bayes and often the frequency approach. TheBayesian approach calculates and typically uses just the observed likelihoodL 0 (θ) =f(y 0 ; θ), omitting other model information as part of a Bayes commitment.The frequency approach uses more than the observed likelihood function:it can use distribution functions and full model calculations, sometimescomponent model calculations that provide relevant precision information,and more.(ii) The ingredients for inference. Inthemodel-datacontext,y 0 is an observedvalue and is thus a known constant, and θ is an unknown constant.And if a distribution π(θ) ispresent,assumed,proposedorcreated,asthesource for θ, then a second distribution is on offer concerning the unknownconstant. Probabilities are then sought for the unknown constant, in the contextof one or two distributional sources: one part of the given and the otherobjective, subjective, or appended for computational or other reasons. Shouldthese distributions be combined, or be examined separately, or should theadded distribution be ignored? No over-riding principle says that distributionsof different status or quality should be combined rather than havingtheir consequences judged separately! Familiar Bayes methodology, however,takes the combining as a given, just as the use of only the observed likelihood


D.A.S. Fraser 241function is taken as a given, essentially axioms in the Bayes methodology! Forarecentdiscussion,seeFraser(2011).(iii) The simple location model. Considerthelocationmodelf(y −θ). Thisis of course rather special in that the error, the variable minus the parameter,has a fixed known distributional shape, free of the parameter. A commonadded or proposed prior is the flat prior π(θ) =1representingthetranslationinvariance of the model. As it stands the model almost seems too simple forconsideration here; but the reality is that this simple model exists as an embeddedapproximation in an incredibly broad class of models where continuityof parameter effect is present and should thus have its influence acknowledged.(iv) Location case: p-value or s-value. Thegenericversionofthep-valuefrom observed data y 0 isp 0 (θ) =∫ y0f(y − θ)dy = F 0 (θ),which records just the statistical position of the data relative to the parameter.As such it is just the observed distribution function. This p(θ) functionis uniform on the interval (0, 1), which in turn implies that any related confidencebound or confidence interval has validity in the sense that it boundsor embraces the true parameter value with the stated reliability; see Fisher(1930) and Neyman (1937). In parallel, the observed Bayes survivor value is∫s 0 (θ) = f(y 0 − α)dα.θThe two different directions of integration correspond to data left of theparameter and parameter right of the data, at least in this stochastically increasingcase. The two integrals are mathematically equal as is seen from aroutine calculus change of variable in the integration. Thus the Bayes survivors-value acquires validity here, validity in the sense that it is uniformlydistributed on (0, 1); and validity also in the sense that a Bayes quantile atalevelβ will have the confidence property and bound the parameter at thestated level. This validity depends entirely on the equivalence of the integralsand no reference or appeal to conditional probability is involved or invoked.Thus in this location model context, a sample space integration can routinelybe replaced by a parameter space integration, a pure calculus formality. Andthus in the location model context there is no frequency-Bayes contradiction,just the matter of choosing the prior that yields the translation property whichin turn enables the integration change of variable and thus the transfer of theintegration from sample space to parameter space.(v) The simple scalar model. Now consider a stochastically increasing scalarmodel f(y; θ)withdistributionfunctionF (y; θ) and some minimum continuityand regularity. The observed p-value isp 0 (θ) =F 0 (θ) =∫ y0∫F y (y; θ)dy = −F θ (y 0 ; θ)dθ,θ


242 Statistics’ two theorieswhere the subscripts to F denote partial differentiation with respect to the indicatedargument. Each of the integrals records an F (y, θ) valueasanintegralof its derivative — the fundamental theorem of calculus — one with respect toθ and the other with respect to y. Thisispurecomputation,entirelywithoutBayes! And then, quite separately, the Bayes survivor value using a profferedprior π(θ) is∫s 0 (θ) = π(θ)F y (y 0 ; θ)dθ.θ(vi) Validity of Bayes posterior: Simple scalar model. Thesecondintegralfor p 0 (θ) and the integral for s 0 (θ) are equal if and only if the integrands areequal. In other words if and only ifπ(θ) =− F θ(y 0 ; θ)F y (y 0 ; θ)=∂y(θ; u)∂θ∣∣fixed F (y;θ);y 0with an appropriate norming constant included. The second equality comesfrom the total derivative of u = F (y; θ) set equal to 0, thus determininghow a θ-change affects y for fixed probability position. We can also viewv(θ) =∂y(θ; u)/∂θ for fixed u as being the change in y caused by a change inθ, thus giving at y 0 a differential version of the y, θ analysis in the precedingsubsection.Again, with this simple scalar model analysis, there is no frequency-Bayescontradiction; it is just a matter of getting the prior right. The correct priordoes depend on the data point y 0 but this should cause no concern. If theobjective of Bayesian analysis is to extract all accessible information from anobserved likelihood and if this then requires the tailoring of the prior to theparticular data, then this is in accord with that objective. Data dependentpriors have been around for a long time; see, e.g., Box and Cox (1964). But ofcourse this data dependence does conflict with a conventional Bayes view thata prior should be available for each model type. The realities of data analysismay not be as simple as Bayes might wish.(vii) What’s the conclusion? With a location model, Bayes and frequencyapproaches are in full agreement: Bayes gets it right because the Bayes calculationis just a frequency confidence calculation in mild disguise. However,with a non-location model, the Bayes claim with a percentage attached toan interval does require a data-dependent prior. But to reference the conditionalprobability lemma, relabeled as Bayes lemma, requires that a missingingredient for the lemma be created, that a density not from the reality beinginvestigated be given objective status in order to nominally validate the termprobability: this violates mathematics and science.


D.A.S. Fraser 24322.4 Inference for regular models: Frequency(i) Normal, exponential, and regular models. Much of contemporary inferencetheory is organized around Normal statistical models with side concernsfor departures from Normality, thus neglecting more general structures. Recentlikelihood methods show, however, that statistical inference is easy anddirect for exponential models and more generally for regular models using anappropriate exponential approximation. Accordingly, let us briefly overviewinference for exponential models.(ii) Exponential statistical model. The exponential family of models iswidely useful both for model building and for model-data analysis. The full exponentialmodel with canonical parameter ϕ and canonical variable u(y) bothof dimension p is f(y; ϕ) =exp{ϕ ′ u(y)+k(ϕ)}h(y). Let y 0 with u 0 = u(y 0 )be observed data for which statistical inference is wanted. For most purposeswe can work with the model in terms of the canonical statistic u:g(u; ϕ) =exp{l 0 (ϕ)+(ϕ − ˆϕ 0 ) ′ (u − u 0 )}g(u),where l 0 (ϕ) =a +lnf(y 0 ; ϕ) istheobservedlog-likelihoodfunctionwiththeusual arbitrary constant chosen conveniently to subtract the maximum loglikelihoodln f(y 0 ;ˆϕ 0 ), using ˆϕ 0 as the observed maximum likelihood value.This representative l 0 (ϕ) has value 0 at ˆϕ 0 , and −l 0 (ϕ) relative to ˆϕ 0 is thecumulant generating function of u − u 0 , and g(u) isaprobabilitydensityfunction. The saddle point then gives a third-order inversion of the cumulantgenerating function −l 0 (ϕ) leadingtothethird-orderrewriteg(u; ϕ) =ek/n(2π) p/2 exp{−r2 (ϕ; u)/2}|j ϕϕ (ˆϕ)| −1/2 ,where ˆϕ =ˆϕ(u) isthemaximumlikelihoodvalueforthetiltedlikelihoodl(ϕ; u) =l 0 (ϕ)+ϕ ′ (u − u 0 ),r 2 (ϕ; u)/2 =l(ˆϕ; u) − l(ϕ; u) istherelatedlog-likelihoodratioquantity,j ϕϕ (ˆϕ) =−∂∂ϕ∂ϕ ′ l(ϕ; u)| ˆϕ(u)is the information matrix at u, and finally k/n is constant to third order. Thedensity approximation g(u; ϕ 0 ) gives an essentially unique third-order nulldistribution (Fraser and Reid, 2013) for testing the parameter value ϕ = ϕ 0 .Then if the parameter ϕ is scalar, we can use standard r ∗ -technology tocalculate the p-value p(ϕ 0 )forassessingϕ = ϕ 0 ;see,e.g.,Brazzaleetal.(2007). For a vector ϕ, adirectedr ∗ departure is available; see, e.g., Davisonet al. (2014). Thus p-values are widely available with high third-order accuracy,


244 Statistics’ two theoriesall with uniqueness coming from the continuity of the parameter’s effect onthe variable involved; see in part Fraser et al. (2010b).(iii) Testing component parameters. Now consider more generally a componentparameter ψ(ϕ) of dimension d with d


D.A.S. Fraser 245that record the effect on y of change in the parameter coordinates θ 1 ,...,θ p ;and let V = V (ˆθ 0 ,y 0 )= ˆV 0 be the observed matrix. Then V records tangentsto an intrinsic ancillary contour, say a(y) =a(y 0 ), that passes through theobserved y 0 .ThusV represents directions in which the data can be viewed asmeasuring the parameter, and LV gives the tangent space to the ancillary atthe observed data, with V having somewhat the role of a design matrix. Fordevelopment details, see Fraser and Reid (1995).From ancillarity it follows that likelihood conditionally is equal to thefull likelihood L 0 (θ), to an order one higher than that of the ancillary used.And it also follows that the sample space gradient of the log-likelihood in thedirections V along the ancillary contour gives the canonical parameter, viz.ϕ(θ) =∂ ∣∂V l(θ; y) ∣∣y,0whenever the conditional model is exponential, or gives the canonical parameterof an approximating exponential model otherwise. In either case, l 0 (θ)with the preceding ϕ(θ) provides third order statistical inference for scalarparameters using the saddle point expression and the above technology. Andthis statistical inference is uniquely determined provided the continuity in themodel is required for the inference (Fraser and Rousseau, 2008). For furtherdiscussion and details, see Fraser et al. (2010a) and Fraser and Reid (1995).22.5 Inference for regular models: BootstrapConsider a regular statistical model and the exponential approximation asdiscussed in the preceding section, and suppose we are interested in testing ascalar parameter ψ(ϕ) =ψ 0 with observed data y 0 .Thebootstrapdistributionis f(y; ψ 0 , ˆλ 0 ψ 0), as used in Fraser and Rousseau (2008) from a log-modelperspective and then in DiCiccio and Young (2008) for the exponential modelcase with linear interest parameter.The ancillary density in the preceding section is third-order free of thenuisance parameter λ. Thusthebootstrapdistributionf(y; ψ 0 , ˆλ 0 ψ 0)providesfull third-order sampling for this ancillary, equivalent to that from the truesampling f(y; ψ 0 ,λ), just the use of a different λ value when the distributionis free of λ.Consider the profile line L 0 through the data point y 0 .Indevelopingtheancillary density (22.1), we made use of the presence of ancillary contourscross-sectional to the line L 0 . Now suppose we have a d-dimensional quantityt(y, ψ) thatprovideslikelihoodcentredandscaleddepartureforψ, e.g.,asignedlikelihoodrootasinBarndorff-NielsenandCox(1994)oraWaldquantity, thus providing the choice in DiCiccio and Young (2008). If t(y) isafunction of the ancillary, say a(y), then one bootstrap cycle gives third order,


246 Statistics’ two theoriesa case of direct sampling; otherwise the conditional distribution of y|a also becomesinvolved and with the likelihood based t(y) gives third order inferenceas in the third cycle of Fraser and Rousseau (2008).This means that the bootstrap and the usual higher-order calculations arethird-order equivalent in some generality, and in reverse that the bootstrapcalculations for a likelihood centred and scaled quantity can be viewed asconsistent with standard higher-order calculations, although clearly this wasnot part of the bootstrap design. This equivalence was presented for the linearinterest parameter case in an exponential model in DiCiccio and Young(2008), and we now have that it holds widely for regular models with linear orcurved interest parameters. For a general regular model, the higher order routinelygives conditioning on full-model ancillary directions while the bootstrapaverages over this conditioning.22.6 Inference for regular models: Bayes(i) Jeffreys prior. The discussion earlier shows that Bayes validity in generalrequires data-dependent priors. For the scalar exponential model, however,it was shown by Welch and Peers (1963) that the root information priorof Jeffreys (1946), viz.π(θ) =j 1/2θθ ,provides full second-order validity, and is presented as a globally defined priorand indeed is not data-dependent. The Welch–Peers presentation does useexpected information, but with exponential models the observed and expectedinformations are equivalent. Are such results then available for the vectorexponential model?For the vector regression-scale model, Jeffreys subsequently noted that hisroot information prior (Jeffreys, 1946) was unsatisfactory and proposed aneffective alternative for that model. And for more general contexts, Bernardo(1979) proposed reference posteriors and thus reference priors, based on maximizingthe Kullback–Leibler distance between prior and posterior. These priorshave some wide acceptance, but can also miss available information.(ii) The Bayes objective: Likelihood based inference. Another way of viewingBayesian analysis is as a procedure to extract maximum information froman observed likelihood function L 0 (θ). This suggests asymptotic analysis andTaylor expansion about the observed maximum likelihood value ˆθ 0 . For thiswe assume a p-dimensional exponential model g(u; ϕ) as expressed in termsof its canonical parameter ϕ and its canonical variable u, either as the givenmodel or as the higher-order approximation mentioned earlier. There are alsosome presentation advantages in using versions of the parameter and of the


D.A.S. Fraser 247variable that give an observed information matrix ĵ 0 ϕϕ = I equal to the identitymatrix.(iii) Insightful local coordinates. Now consider the form of the log-model inthe neighborhood of the observed data (u 0 , ˆϕ 0 ). And let e be a p-dimensionalunit vector that provides a direction from ˆϕ 0 or from u 0 .Theconditionalstatistical model along the line u 0 + Le is available from exponential modeltheory and is just a scalar exponential model with scalar canonical parameterρ, whereϕ = ˆϕ 0 + ρe is given by polar coordinates. Likelihood theoryalso shows that the conditional distribution is second-order equivalent to themarginal distribution for assessing ρ. Therelatedpriorjρρ 1/2 dρ for ρ would useλ = ˆλ 0 ,whereλ is the canonical parameter complementing ρ.(iv) The differential prior. Now suppose the preceding prior jρρ 1/2 dρ is usedon each line ˆϕ 0 + Le. This composite prior on the full parameter space canbe called the differential prior and provides crucial information for Bayes inference.But as such it is of course subject to the well-known limitation ondistributions for parameters, both confidence and Bayes; they give incorrectresults for curved parameters unless the pivot or prior is targeted on thecurved parameter of interest; for details, see, e.g., Dawid et al. (1973) andFraser (2011).(v) Location model: Why not use the location property? The appropriateprior for ρ would lead to a constant-information parameterization, whichwould provide a location relationship near the observed (y 0 , ˆϕ 0 ). As such thep-value for a linear parameter would have a reflected Bayes survivor s-value,thus leading to second order. Such is not a full location model property, justalocationpropertynearthedatapoint,butthisisallthatisneededforthereflected transfer of probability from the sample space to the parameter space,thereby enabling a second-order Bayes calculation.(vi) Second-order for scalar parameters? But there is more. The conditionaldistribution for a linear parameter does provide third order inference and itdoes use the full likelihood but that full likelihood needs an adjustment for theconditioning (Fraser and Reid, 2013). It follows that even a linear parameter inan exponential model needs targeting for Bayes inference, and a local or globalprior cannot generally yield second-order inference for linear parameters, letalone for the curved parameters as in Dawid et al. (1973) and Fraser (2013).22.7 The frequency-Bayes contradictionSo is there a frequency-Bayes contradiction? Or a frequency-bootstrap-Bayescontradiction? Not if one respects the continuity widely present in regularstatistical models and then requires the continuity to be respected for thefrequency calculations and for the choice of Bayes prior.


248 Statistics’ two theoriesFrequency theory of course routinely leaves open the choice of pivotalquantity which provides the basis for tests, confidence bounds, and relatedintervals and distributions. And Bayes theory leaves open the choice of theprior for extracting information from the likelihood function. And the bootstrapneeds a tactical choice of initial statistic to succeed in one bootstrapcycle. Thus on the surface there is a lot of apparent arbitrariness in the usualinference procedures, with a consequent potential for serious contradictions.In the frequency approach, however, this arbitrariness essentially disappearsif continuity of parameter effect in the model is respected, and then requiredin the inference calculations; see Fraser et al. (2010b) and the discussion inearlier sections. And for the Bayes approach above, the arbitrariness can disappearif the need for data dependence is acknowledged and the locally baseddifferential prior is used to examine sample space probability on the parameterspace. This extracts information from the likelihood function to the secondorder, but just for linear parameters (Fraser, 2013).The frequency and the bootstrap approaches can succeed without arbitrarinessto third order. The Bayes approach can succeed to second orderprovided the parameter is linear, otherwise the prior needs to target the particularinterest parameter. And if distributions are used to describe unknownparameter values, the frequency joins the Bayes in being restricted to linearparameters unless there is targeting; see Dawid et al. (1973) and Fraser (2011).22.8 Discussion(i) Scalar case. Webeganwiththesimplescalarlocationcase,feelingthatclarity should be present at that transparent level if sensible inference was tobe available more generally. And we found at point (ii) that there were noBayes-frequency contradictions in the location model case so long as modelcontinuity was respected and the Bayes s-value was obtained from the locationbased prior. Then at point (v) in the general scalar case, we saw that the p-value retains its interpretation as the statistical position of the data and hasfull repetition validity, but the Bayes requires a prior determined by the formof the model and is typically data dependent. For the scalar model case thisis a radical limitation on the Bayes approach; in other words inverting thedistribution function as pivot works immediately for the frequency approachwhereas inverting the likelihood using the conditional probability lemma asa tool requires the prior to reflect the location property, at least locally. Forthe scalar model context, this represents a full vindication of Fisher (1930),subject to the Neyman (1937) restriction that probabilities be attached onlyto the inverses of pivot sets.(ii) Vector case. Most models however involve more than just a scalar parameter.So what about the frequency-Bayes disconnect away from the very


D.A.S. Fraser 249simple scalar case? The Bayes method arose from an unusual original example(Bayes, 1763), where at the analysis stage the parameter was retroactivelyviewed as generated randomly by a physical process, indeed an earlier performanceof the process under study. Thus a real frequency-based prior wasintroduced hypothetically and became the progenitor for the present Bayesprocedure. In due course a prior then evolved as a means for exploring, forinserting feelings, or for technical reasons to achieve analysis when direct methodsseemed unavailable. But do we have to make up a prior to avoid admittingthat direct methods of analysis were not in obvious abundance?(iii) Distributions for parameters? Fisherpresentedthefiducialdistributionin Fisher (1930, 1935) and in various subsequent papers. He was criticizedfrom the frequency viewpoint because his proposal left certain things arbitraryand thus not in a fully developed form as expected by the mathematics communityat that time: welcome to statistics as a developing discipline! And hewas criticized sharply from the Bayes (Lindley, 1958) because Fisher proposeddistributions for a parameter and such were firmly viewed as Bayes territory.We now have substantial grounds that the exact route to a distribution for aparameter is the Fisher route, and that Bayes becomes an approximation tothe Fisher confidence and can even attain second-order validity (Fraser, 2011)but requires targeting even for linear parameters.But the root problem is that a distribution for a vector parameter is inherentlyinvalid beyond first order (Fraser, 2011). Certainly in some generalitywith a linear parameter the routine frequency and routine Bayes can agree.But if parameter curvature is allowed then the frequency p-value and the Bayess-value change in opposite directions: thep-value retains its validity, havingthe uniform distribution on the interval (0, 1) property, while the Bayes losesthis property and associated validity, yet chooses to retain the label “probability”by discipline commitment, as used from early on. In all the Bayes casesthe events receiving probabilities are events in the past, and the prior probabilityinput to the conditional probability lemma is widely there for expediency:the lemma does not create real probabilities from hypothetical probabilitiesexcept when there is location equivalence.(iv) Overview. Most inference contradictions disappear if continuitypresent in the model is required for the inference calculations. Higher orderfrequency and bootstrap are consistent to third order for scalar parameters.Bayes agrees but just for location parameters and then to first order for otherparameters, and for this Bayes does need a prior that reflects or approximatesthe location relationship between variable and parameter. Some recent preliminaryreports are available at http://www.utstat.toronto.edu/dfraser/documents/ as 260-V3.pdf and 265-V3.pdf.


250 Statistics’ two theoriesAcknowledgmentThis research was funded in part by the Natural Sciences and Engineering ResearchCouncil of Canada, by the Senior Scholars Funding at York University,and by the Department of Statistical Sciences at the University of Toronto.Thanks to C. Genest and A. Wang for help in preparing the manuscript.ReferencesAbraham, C. (2009). Vioxx took deadly toll study. Globe and Mailhttp://www.theglobeandmail.com/life/study-finds-vioxx-tookdeadly-toll/article4114560/Barndorff-Nielsen, O.E. and Cox, D.R. (1994). Inference and Asymptotics.Chapman & Hall, London.Bayes, T. (1763). An essay towards solving a problem in the doctrine ofchances. Philosophical Transactions of the Royal Society, London,53:370–418.Bergin, C. (2007). Remembering the mistakes of Challenger.nasaspaceflight.com.Bernardo, J.M. (1979). Reference posterior distributions for Bayesian inference.Journal of the Royal Statistical Society, Series B, 41:113–147.Box, G.E.P. and Cox, D.R. (1964). An analysis of transformations (withdiscussion). Journal of the Royal Statistical Society, Series B, 26:211–252.Brazzale, A.R., Davison, A.C., and Reid, N.M. (2007). Applied Asymptotics.Cambridge University Press, Cambridge, UK.Dalai, S. and Fowlkes, B. (1989). Risk analysis of the space shuttle: Pre-Challenger prediction of failure. Journal of the American Statistical Association,84:945–957.Davison, A.C., Fraser, D.A.S., Reid, N.M., and Sartori, N. (2014). Accuratedirectional inference for vector parameters. Journal of the AmericanStatistical Association, to appear.Dawid, A.P., Stone, M., and Zidek, J.V. (1973). Marginalization paradoxes inBayesian and structural inference. Journal of the Royal Statistical Society,Series B, 35:189–233.


D.A.S. Fraser 251DiCiccio, T.J. and Young, G.A. (2008). Conditional properties of unconditionalparametric bootstrap procedures for inference in exponential families.Biometrika, 95:497–504.Fisher, R.A. (1930). Inverse probability. Proceedings of the Cambridge PhilosophicalSociety, 26:528–535.Fisher, R.A. (1935). The fiducial argument in statistical inference. Annals ofEugenics, 6:391–398.Fraser, A.M., Fraser, D.A.S., and Fraser, M.J. (2010a). Parameter curvaturerevisited and the Bayesian frequentist divergence. Statistical Research:Efron Volume, 44:335–346.Fraser, A.M., Fraser, D.A.S., and Staicu, A.M. (2010b). Second order ancillary:A differential view with continuity. Bernoulli, 16:1208–1223.Fraser, D.A.S. (1961a). The fiducial method and invariance. Biometrika,48:261–280.Fraser, D.A.S. (1961b). On fiducial inference. The Annals of MathematicalStatistics, 32:661–676.Fraser, D.A.S. (2011). Is Bayes posterior just quick and dirty confidence?(with discussion). Statistical Science, 26:299–316.Fraser, D.A.S. (2013). Can Bayes inference be second-order for scalar parameters?Submittedforpublication.Fraser, D.A.S. and Reid, N.M. (1993). Third order asymptotic models: Likelihoodfunctions leading to accurate approximations for distribution functions.Statistica Sinica, 3:67–82.Fraser, D.A.S. and Reid, N.M. (1995). Ancillaries and third order significance.Utilitas Mathematica, 47:33–53.Fraser, D.A.S. and Reid, N.M. (2013). Assessing a parameter of interest:Higher-order methods and the bootstrap. Submittedforpublication.Fraser, D.A.S. and Rousseau, J. (2008). Studentization and deriving accuratep-values. Biometrika, 95:1–16.Jeffreys, H. (1946). An invariant form for the prior probability in estimationproblems. Proceedings of the Royal Society, Series A, 186:453–461.Léger, C. (2013). The Statistical Society of Canada (SSC) response to theNSERC consultation on the evaluation of the Discovery Grants Program.SSC Liaison, 27(2):12–21.Lindley, D.V. (1958). Fiducial distribution and Bayes theorem. Journal ofthe Royal Statistical Society, Series B, 20:102–107.


252 Statistics’ two theoriesMarshall, M. (2012). Seismologists found guilty of manslaughter. New Scientist,October22,2012.Neyman, J. (1937). Outline of a theory of statistical estimation based on theclassical theory of probability. Philosophical Transactions of the RoyalSociety, Series A, 237:333–380.Pielke, R. (2011). Lessons of the L’Aquila lawsuit. Bridges 31, http://www.bbc.co.uk/news/world-europe-20025626.Prats, J. (2012). The L’Aquila earthquake. Significance, 9:13–16.Welch, B.L. and Peers, H.W. (1963). On formulae for confidence points basedon intervals of weighted likelihoods. Journal of the Royal Statistical Society,Series B, 25:318–329.


23Conditioning is the issueJames O. BergerDepartment of Statistical ScienceDuke University, Durham, NCThe importance of conditioning in statistics and its implementation are highlightedthrough the series of examples that most strongly affected my understandingof the issue. The examples range from “oldies but goodies” to newexamples that illustrate the importance of thinking conditionally in modernstatistical developments. The enormous potential impact of improved handlingof conditioning is also illustrated.23.1 IntroductionNo, this is not about conditioning in the sense of “I was conditioned to be aBayesian.” Indeed I was educated at Cornell University in the early 1970s, byJack Kiefer, Jack Wolfowitz, Roger Farrell and my advisor Larry Brown, inastrongfrequentisttradition,albeitwithheavyuseofpriordistributionsastechnical tools. My early work on shrinkage estimation got me thinking moreabout the Bayesian perspective; doesn’t one need to decide where to shrink,and how can that decision not require Bayesian thinking? But it wasn’t untilIencounteredstatisticalconditioning(seethenextsectionifyoudonotknowwhat that means) and the Likelihood Principle that I suddenly felt like I hadwoken up and was beginning to understand the foundations of statistics.Bayesian analysis, because it is completely conditional (depending on thestatistical model only through the observed data), automatically conditionsproperly and, hence, has been the focus of much of my work. But I neverstopped being a frequentist and came to understand that frequentists canalso appropriately condition. Not surprisingly (in retrospect, but not at thetime) I found that, when frequentists do appropriately condition, they obtainanswers remarkably like the Bayesian answers; this, in my mind, makes conditioningthe key issue in the foundations of statistics, as it unifies the twomajor perspectives of statistics.


254 Conditioning is the issueThe practical importance of conditioning arises because, when it is notdone in certain scenarios, the results can be very detrimental to science. Unfortunately,this is the case for many of the most commonly used statisticalprocedures, as will be discussed.This chapter is a brief tour of old and new examples that most influencedme over the years concerning the need to appropriately condition. The newones include performing a sequence of tests, as is now common in clinical trialsand is being done badly, and an example involving a type of false discovery rate.23.2 Cox example and a pedagogical exampleAs this is more of an account of my own experiences with conditioning, I havenot tried to track down when the notion first arose. Pierre Simon de Laplacelikely understood the issue, as he spent much of his career as a Bayesianin dealing with applied problems and then, later in life, also developed frequentistinference. Clearly Ronald Fisher and Harold Jeffreys knew all aboutconditioning early on. My first introduction to conditioning was the exampleof Cox (1958).AvariantoftheCoxexample:Every day an employee enters a lab toperform assays, and is assigned an unbiased instrument to perform the assays.Half of the available instruments are new and have a small variance of 1, whilethe other half are old and have a variance of 3. The employee is assigned eachtype with probability 1/2, and knows whether the instrument is old or new.Conditional inference: For each assay, report variance 1 or 3, depending onwhether a new or an old instrument is being used.Unconditional inference: The overall variance of the assays is .5×1+.5×3 = 2,so report a variance of 2 always.It seems silly to do the unconditional inference here, especially when notingthat the conditional inference is also fully frequentist; in the latter, one is justchoosing different subset of events over which to do a long run average.The Cox example contains the essence of conditioning, but tends to bedismissed because of the issue of “global frequentism.” The completely purefrequentist position is that one’s entire life is a huge experiment, and so thecorrect frequentist average is over all possibilities in all situations involvinguncertainty that one encounters in life. As this is clearly impossible, frequentistshave historically chosen to condition on the experiment actually beingconducted before applying frequentist logic; then Cox’s example would seemirrelevant. However, the virtually identical issue can arise within an experiment,as demonstrated in the following example, first appearing in Berger andWolpert (1984).


J.O. Berger 255Pedagogical example: Two observations, X 1 and X 2 , are to be taken, where{θ +1 withprobability1/2,X i =θ − 1 with probability 1/2.Consider the confidence set for the unknown θ ∈ IR:{the singleton {(X1 + XC(X 1 ,X 2 )=2 )/2} if X 1 ≠ X 2 ,the singleton {X 1 − 1} if X 1 = X 2 .The frequentist coverage of this confidence set isP θ {C(X 1 ,X 2 )containsθ} = .75,which is not at all a sensible report once the data is at hand. Indeed, if x 1 ≠ x 2 ,then we know for sure that (x 1 + x 2 )/2 is equal to θ, sothattheconfidenceset is then actually 100% accurate. On the other hand, if x 1 = x 2 , we do notknow whether θ is the data’s common value plus 1 or their common valueminus 1, and each of these possibilities is equally likely to have occurred;the confidence interval is then only 50% accurate. While it is not wrong tosay that the confidence interval has 75% coverage, it is obviously much morescientifically useful to report 100% or 50%, depending on the data. And again,this conditional report is still fully frequentist, averaging over the sets of data{(x 1 ,x 2 ):x 1 ≠ x 2 } and {(x 1 ,x 2 ):x 1 = x 2 },respectively.23.3 Likelihood and stopping rule principlesSuppose an experiment E is conducted, which consists of observing data Xhaving density f(x|θ), where θ is the unknown parameters of the statisticalmodel. Let x obs denote the data actually observed.Likelihood Principle (LP): The information about θ, arisingfromjustEand x obs ,iscontainedintheobservedlikelihoodfunctionL(θ) =f(x obs |θ) .Furthermore, if two observed likelihood functions are proportional, then theycontain the same information about θ.The LP is quite controversial, in that it effectively precludes use of frequentistmeasures, which all involve averages of f(x|θ) overx that are notobserved. Bayesians automatically follow the LP because the posterior distributionof θ follows from Bayes’ theorem (with p(θ) beingthepriordensityforθ) asp(θ)f(x obs |θ)p(θ|x obs )= ∫p(θ′ )f(x obs |θ ′ )dθ , ′


256 Conditioning is the issuewhich clearly depends on E and x obs only through the observed likelihoodfunction. There was not much attention paid to the LP by non-Bayesians,however, until the remarkable paper of Birnbaum (1962), which deduced theLP as a consequence of the conditionality principle (essentially the Cox example,saying that one should base the inference on the measuring instrumentactually used) and the sufficiency principle, whichstatesthatasufficientstatistic for θ in E contains all information about θ that is available from theexperiment. At the time of Birnbaum’s paper, almost everyone agreed withthe conditionality principle and the sufficiency principle, so it was a shockthat the LP was a direct consequence of the two. The paper had a profoundeffect on my own thinking.There are numerous clarifications and qualifications relevant to the LP, andvarious generalizations and implications. Many of these (and the history of theLP) are summarized in Berger and Wolpert (1984). Without going further,suffice it to say that the LP is, at a minimum, a very powerful argument forconditioning.Stopping Rule Principle (SRP): The reasons for stopping experimentationhave no bearing on the information about θ arising from E and x obs .The SRP is actually an immediate consequence of the second part of theLP, since “stopping rules” affect L(θ) only by multiplicative constants. Seriousdiscussion of the SRP goes back at least to Barnard (1947), who wondered whythoughts in the experimenter’s head concerning why to stop an experimentshould affect how we analyze the actual data that were obtained.Frequentists typically violate the SRP. In clinical trials, for instance, it isstandard to “spend α” for looks at the data — i.e., if there are to be interimanalyses during the trial, with the option of stopping the trial early shouldthe data look convincing, frequentists view it to be mandatory to adjust theallowed error probability (down) to account for the multiple analyses.In Berger and Berry (1988), there is extensive discussion of these issues,with earlier references. The complexity of the issue was illustrated by a commentof Jimmy Savage:“I learned the stopping rule principle from Professor Barnard, in conversationin the summer of 1952. Frankly, I then thought it a scandalthat anyone in the profession could advance an idea so patentlywrong, even as today I can scarcely believe that people resist an ideaso patently right.” (Savage et al., 1962)


J.O. Berger 257The SRP does not say that one is free to ignore the stopping rule in anystatistical analysis. For instance, common practice in some sciences, whentesting a null hypothesis H 0 , is to continue collecting data until the p-valuesatisfies p


258 Conditioning is the issueathoughtexperiment.Whatisdoneinpracticeistousetheconfidenceprocedureon a series of different problems — not a series of imaginary repetitionsof the same problem with different data.Neyman himself often pointed out that the motivation for the frequentistprinciple is in its use on differing real problems; see, e.g., Neyman (1977). Ofcourse, the reason textbooks typically give the imaginary repetition of an experimentversion is because of the mathematical fact that if, say, a confidenceprocedure has 95% frequentist coverage for each fixed parameter value, thenit will necessarily also have 95% coverage when used repeatedly on a seriesof differing problems. And, if the coverage is not constant over each fixed parametervalue, one can always find the minimum coverage over the parameterspace, since it will follow that the real frequentist coverage in repeated useof the procedure on real problems will never be worse than this minimumcoverage.Pedagogical example continued: Reporting 50% and 100% confidence, asappropriate, is fully frequentist, in that the long run reported coverage willaverage .75, which is the long run actual coverage.p-values: p-values are not frequentist measures of evidence in any long runaverage sense. Suppose we observe X, have a null hypothesis H 0 , and constructa proper p-value p(X). Viewing the observed p(x obs ) as a conditional error ratewhen rejecting H 0 is not correct from the frequentist perspective. To see this,note that, under the null hypothesis, a proper p-value will be Uniform on theinterval (0, 1), so that if rejection occurs when p(X) ≤ α, the average reportedp-value under H 0 and rejection will beE[p(X)|H 0 , {p(·) ≤ α}] =∫ α0p 1 α dp = α 2 ,which is only half the actual long run error α. Therehavebeenothereffortstogive a real frequentist interpretation of a p-value, none of them successful interms of the definition at the beginning of this section. Note that the procedure{reject H 0 when p(X) ≤ α} is a fully correct frequentist procedure, but thestated error rate in rejection must be α, notthep-value.There have certainly been other ways of defining frequentism; see, e.g.,Mayo (1996) for discussion. However, it is only the version given at the beginningof the section that strikes me as being compelling. How could one want togive statistical inferences that, over the long run, systematically distort theirassociated accuracies?


J.O. Berger 25923.5 Conditional frequentist inference23.5.1 IntroductionThe theory of combining the frequentist principle with conditioning was formalizedby Kiefer in Kiefer (1977), although there were many precursors to thetheory initiated by Fisher and others. There are several versions of the theory,but the most useful has been to begin by defining a conditioning statistic Swhich measures the “strength of evidence” in the data. Then one computesthe desired frequentist measure, but does so conditional on the strength ofevidence S.Pedagogical example continued: S = |X 1 − X 2 | is the obvious choice,S =2reflectingdatawithmaximalevidentialcontent(correspondingtothesituation of 100% confidence) and S = 0 being data of minimal evidentialcontent. Here coverage probability is the desired frequentist criterion, and aneasy computation shows that conditional coverage given S is given byP θ {C(X 1 ,X 2 )containsθ | S =2} = 1,P θ {C(X 1 ,X 2 )containsθ | S =0} = 1/2,for the two distinct cases, which are the intuitively correct answers.23.5.2 Ancillary statistics and invariant modelsAn ancillary statistic is a statistic S whose distribution does not depend onunknown model parameters θ. Inthepedagogicalexample,S =0andS =2have probability 1/2 each, independent of θ, and so S is ancillary. When ancillarystatistics exist, they are usually good measures of the strength of evidencein the data, and hence provide good candidates for conditional frequentist inference.The most important situations involving ancillary statistics arise when themodel has what is called a group-invariance structure; cf. Berger (1985) andEaton (1989). When this structure is present, the best ancillary statistic to useis what is called the maximal invariant statistic. Doing conditional frequentistinference with the maximal invariant statistic is then equivalent to performingBayesian inference with the right-Haar prior distribution with respect to thegroup action; cf. Berger (1985), Eaton (1989), and Stein (1965).Example–Location Distributions: Suppose X 1 ,...,X n form a randomsample from the location density f(x i − θ). This model is invariant under thegroup operation defined by adding any constant to each observation and θ; themaximal invariant statistic (in general) is S =(x 2 − x 1 ,x 3 − x 1 ,...,x n − x 1 ),and performing conditional frequentist inference, conditional on S, willgivethe same numerical answers as performing Bayesian inference with the right-


260 Conditioning is the issueHaar prior, here simply given by p(θ) = 1. For instance, the optimal conditionalfrequentist estimator of θ under squared error loss would simply be theposterior mean with respect to p(θ) = 1, namely∫ ∏ n θi=1 ˆθ =f(x i − θ)1dθ∫ ∏ ni=1 f(x i − θ)1dθ ,which is also known as Pitman’s estimator.Having a model with a group-invariance structure leaves one in an incrediblypowerful situation, and this happens with many of our most commonstatistical problems (mostly from an estimation perspective). The difficultiesof the conditional frequentist perspective are (i) finding the right strength ofevidence statistic S, and (ii) carrying out the conditional frequentist computation.But, if one has a group-invariant model, these difficulties can be bypassedbecause theory says that the optimal conditional frequentist answer is the answerobtained from the much simpler Bayesian analysis with the right-Haarprior.Note that the conditional frequentist and Bayesian answers will have differentinterpretations. For instance both approaches would produce the same95% confidence set, but the conditional frequentist would say that the frequentistcoverage, conditional on S (and also unconditionally), is 95%, while theBayesian would say the set has probability .95 of actually containing θ. Alsonote that it is not automatically true that analysis conditional on ancillarystatistics is optimal; see, e.g., Brown (1990).23.5.3 Conditional frequentist testingUpon rejection of the H 0 in unconditional Neyman–Pearson testing, one reportsthe same error probability α regardless of where the test statistic is inthe rejection region. This has been viewed as problematical by many, and isone of the main reasons for the popularity of p-values. But as we saw earlier,p-values do not satisfy the frequentist principle, and so are not the conditionalfrequentist answer.AtrueconditionalfrequentistsolutiontotheproblemwasproposedinBerger et al. (1994), with modification (given below) from Sellke et al. (2001)and Wolpert (1996). Suppose that we wish to test that the data X arises fromthe simple (i.e., completely specified) hypotheses H 0 : f = f 0 or H 1 : f = f 1 .The recommended strength of evidence statistic isS = max{p 0 (x),p 1 (x)},where p 0 (x) isthep-value when testing H 0 versus H 1 , and p 1 (x) isthepvaluewhen testing H 1 versus H 0 . It is generally agreed that smaller p-valuescorrespond to more evidence against an hypothesis, so this use of p-valuesin determining the strength of evidence statistic is natural. The frequentist


J.O. Berger 261conditional error probabilities (CEPs) are computed asα(s) =Pr(TypeIerror|S = s) ≡ P 0 {reject H 0 |S(X) =s},β(s) =Pr(TypeIIerror|S = s) ≡ P 1 {accept H 0 |S(X) =s},(23.1)where P 0 and P 1 refer to probability under H 0 and H 1 ,respectively.The corresponding conditional frequentist test is thenIf p 0 ≤ p 1 , reject H 0 and report Type I CEP α(s);If p 0 >p 1 , accept H 0 and report Type II CEP β(s);(23.2)where the CEPs are given in (23.1).These conditional error probabilities are fully frequentist and vary over therejection region as one would expect. In a sense, this procedure can be viewedas a way to turn p-values into actual error probabilities.It was mentioned in the introduction that, when a good conditional frequentistprocedure has been found, it often turns out to be numerically equivalentto a Bayesian procedure. That is the case here. Indeed, Berger et al.(1994) shows thatα(s) =Pr(H 0 |x) , β(s) =Pr(H 1 |x) , (23.3)where Pr(H 0 |x) and Pr(H 1 |x) are the Bayesian posterior probabilities of H 0and H 1 , respectively, assuming the hypotheses have equal prior probabilitiesof 1/2. Therefore, a conditional frequentist can simply compute the objectiveBayesian posterior probabilities of the hypotheses, and declare that they arethe conditional frequentist error probabilities; there is no need to formallyderive the conditioning statistic or perform the conditional frequentist computations.There are many generalizations of this beyond the simple versussimple testing.The practical import of switching to conditional frequentist testing (or theequivalent objective Bayesian testing) is startling. For instance, Sellke et al.(2001) uses a nonparametric setting to develop the following very general lowerbound on α(s), for a given p-value:1α(s) ≥11 −ep ln(p). (23.4)Some values of this lower bound for common p-values are given in Table 23.1.Thus p = .05, which many erroneously think implies strong evidence againstH 0 , actually corresponds to a conditional frequentist error probability at leastas large as .289, which is a rather large error probability. If scientists understoodthat a p-value of .05 corresponded to that large a potential errorprobability in rejection, the scientific world would be a quite different place.


262 Conditioning is the issueTABLE 23.1Values of the lower bound α(s) in (23.4) for various values of p.p .2 .1 .05 .01 .005 .001 .0001 .00001α(s) .465 .385 .289 .111 .067 .0184 .0025 .0003123.5.4 Testing a sequence of hypothesesIt is common in clinical trials to test multiple endpoints but to do so sequentially,only considering the next hypothesis if the previous hypothesis was arejection of the null. For instance, the primary endpoint for a drug might beweight reduction, with the secondary endpoint being reduction in an allergicreaction. (Typically, these will be more biologically related endpoints butthe point here is better made when the endpoints have little to do with eachother.) Denote the primary endpoint (null hypothesis) by H0, 1 and the statisticalanalysis must first test this hypothesis. If the hypothesis is not rejectedat level α, the analysis stops — i.e., no further hypotheses can be considered.However, if the hypothesis is rejected, one can go on and consider the secondaryendpoint, defined by null hypothesis H0.Supposethishypothesisis2also rejected at level α.Surprisingly, the overall probability of Type I error (rejecting at least onetrue null hypothesis) for this procedure is still just α — see, e.g., Hsu andBerger (1999) — even though there is the possibility of rejecting two separatehypotheses. It appears that the second test comes “for free,” with rejectionallowing one to claim two discoveries for the price of one. This actually seemstoo remarkable; how can we be as confident that both rejections are correctas we are that just the first rejection is correct?If this latter intuition is not clear, note that one does not need to stopafter two hypotheses. If the second has rejected, one can test H0 3 and, if thatis rejected at level α, one can go on to test a fourth hypothesis H0,etc.Suppose4one follows this procedure and has rejected H0,...,H 1 0 10 .Itisstilltruethatthe probability of Type I error for the procedure — i.e., the probability thatthe procedure will result in an erroneous rejection — is just α. Butitseemsridiculous to think that there is only probability α that at least one of the 10rejections is incorrect. (Or imagine a million rejections in a row, if you do notfind the argument for 10 convincing.)The problem here is in the use of the unconditional Type I error to judgeaccuracy. Before starting the sequence of tests, the probability that the procedureyields at least one incorrect rejection is indeed, α, butthesituationchanges dramatically as we start down the path of rejections. The simplestway to see this is to view the situation from the Bayesian perspective. Considerthe situation in which all the hypotheses can be viewed as aprioriindependent(i.e., knowing that one is true or false does not affect perceptions of the


J.O. Berger 263others). If x is the overall data from the trial, and a total of m tests areultimately conducted by the procedure, all claimed to be rejections (i.e., allclaimed to correspond to the H i 0 being false), the Bayesian computesPr (at least one incorrect rejection |x)= 1− Pr(no incorrect rejections|x)m∏= 1− {1 − Pr(H0|x)} i , (23.5)i=1where Pr(H i 0|x) is the posterior probability that H i 0 is true given the data.Clearly, as m grows, (23.5) will go to 1 so that, if there are enough tests,the Bayesian becomes essentially sure that at least one of the rejections waswrong. From Section 23.5.3, recall that Bayesian testing can be exactly equivalentto conditional frequentist testing, so it should be possible to construct aconditional frequentist variant of (23.5). This will, however, be pursued elsewhere.While we assumed that the hypotheses are all aprioriindependent, itis more typical in the multiple endpoint scenario that they will be apriorirelated (e.g., different dosages of a drug). This can be handled within theBayesian approach (and will be explored elsewhere), but it is not clear howa frequentist could incorporate this information, since it is information aboutthe prior probabilities of hypotheses.23.5.5 True to false discovery oddsA very important paper in the history of genome wide association studies (theeffort to find which genes are associated with certain diseases) was Burtonet al. (2007). Consider testing H 0 : θ =0versusanalternativeH 1 : θ ≠ 0,with rejection region R and corresponding Type I and Type II errors α andβ(θ). Let p(θ) bethepriordensityofθ under H 1 , and define the average power∫1 − ¯β = {1 − β(θ)}p(θ)dθ.Frequentists would typically just pick some value θ ∗ at which to evaluate thepower; this is equivalent to choosing p(θ) to be a point mass at θ ∗ .The paper observed that, pre-experimentally, the odds of correctly rejectingH 0 to incorrectly rejecting areO pre = π 1π 0× 1 − ¯βα , (23.6)where π 0 and π 1 =1− π 0 are the prior probabilities of H 0 and H 1 .Thecorresponding false discovery rate would be (1 + O pre ) −1 .The paper went on to assess the prior odds π 1 /π 0 of a genome/diseaseassociation to be 1/100, 000, and estimated the average power of a GWAS


264 Conditioning is the issuetest to be .5. It was decided that a discovery should be reported if O pre ≥ 10,which from (23.6) would require α ≤ 5 × 10 −7 ;thisbecametherecommendedstandard for significance in GWAS studies. Using this standard for a large dataset, the paper found 21 genome/disease associations, virtually all of which havebeen subsequently verified.An alternative approach that was discussed in the paper is to use theposterior odds rather than pre-experimental odds — i.e., to condition. Theposterior odds areO post (x) = π 1× m(x|H 1), (23.7)π 0 f(x|0)where m(x|H 1 )= ∫ f(x|θ)p(θ)dθ is the marginal likelihood of the data xunder H 1 .(Again,thispriorcouldbeapointmassatθ ∗ in a frequentistsetting.) It was noted in the paper that the posterior odds for the 21 claimedassociations ranged between 1/10 (i.e., evidence against the association beingtrue) to 10 68 (overwhelming evidence in favor of the association). It wouldseem that these conditional odds, based on the actual data, are much morescientifically informative than the fixed pre-experimental odds of 10/1 for thechosen α, butthepaperdidnotultimatelyrecommendtheirusebecauseitwas felt that a frequentist justification was needed.Actually, use of O post is as fully frequentist as is use of O pre ,sinceitis trivial to show that E{O post (x)|H 0 , R} = O pre ,i.e.,theaverageoftheconditional reported odds equals the actual pre-experimental reported odds,which is all that is needed to be fully frequentist. So one can have the muchmore scientifically useful conditional report, while maintaining full frequentistjustification. This is yet another case where, upon getting the conditioningright, a frequentist completely agrees with a Bayesian.23.6 Final commentsLots of bad science is being done because of a lack of recognition of theimportance of conditioning in statistics. Overwhelmingly at the top of the listis the use of p-values and acting as if they are actually error probabilities.The common approach to testing a sequence of hypotheses is a new additionto the list of bad science because of a lack of conditioning. The use of preexperimentalodds rather than posterior odds in GWAS studies is not so muchbad science, as a failure to recognize a conditional frequentist opportunitythat is available to improve science. Violation of the stopping rule principlein sequential (or interim) analysis is in a funny position. While it is generallysuboptimal (for instance, one could do conditional frequentist testing instead),it may be necessary if one is committed to certain inferential procedures suchas fixed Type I error probabilities. (In other words, one mistake may requirethe incorporation of another mistake.)


J.O. Berger 265How does a frequentist know when a serious conditioning mistake is beingmade? We have seen a number of situations where it is clear but, in general,there is only one way to identify if conditioning is an issue — Bayesian analysis.If one can find a Bayesian analysis for a reasonable prior that yields the sameanswer as the frequentist analysis, then there is probably not a conditioningissue; otherwise, the conflicting answers are probably due to the need forconditioning on the frequentist side.The most problematic situations (and unfortunately there are many) arethose for which there exists an apparently sensible unconditional frequentistanalysis but Bayesian analysis is unavailable or too difficult to implement givenavailable resources. There is then not much choice but to use the unconditionalfrequentist analysis, but one might be doing something silly because of notbeing able to condition and one will not know. The situation is somewhatcomparable to seeing the report of a Bayesian analysis but not having accessto the prior distribution.While I have enjoyed reminiscing about conditioning, I remain as perplexedtoday as 35 years ago when I first learned about the issue; why do we still nottreat conditioning as one of the most central issues in statistics?ReferencesBarnard, G.A. (1947). A review of ‘Sequential Analysis’ by Abraham Wald.Journal of the American Statistical Association, 42:658–669.Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis.Springer, New York.Berger, J.O. and Berry, D.A. (1988). The relevance of stopping rules in statisticalinference. In Statistical Decision Theory and Related Topics IV,1. Springer,NewYork,pp.29–47.Berger, J.O., Boukai, B., and Wang, Y. (1999). Simultaneous Bayesianfrequentistsequential testing of nested hypotheses. Biometrika, 86:79–92.Berger, J.O., Brown, L.D., and Wolpert, R.L. (1994). A unified conditionalfrequentist and Bayesian test for fixed and sequential simple hypothesistesting. The Annals of Statistics, 22:1787–1807.Berger, J.O. and Wolpert, R.L. (1984). The Likelihood Principle. IMSLectureNotes, Monograph Series, 6. Institute of Mathematical Statistics,Hayward, CA.Berkson, J. (1938). Some difficulties of interpretation encountered in theapplication of the chi-square test. Journal of the American StatisticalAssociation, 33:526–536.


266 Conditioning is the issueBirnbaum, A. (1962). On the foundations of statistical inference. Journal ofthe American Statistical Association, 57:269–306.Brown, L.D. (1990). An ancillarity paradox which appears in multiple linearregression. The Annals of Statistics, 18:471–493.Burton, P.R., Clayton, D.G., Cardon, L.R., Craddock, N., Deloukas, P.,Duncanson, A., Kwiatkowski, D.P., McCarthy, M.I., Ouwehand, W.H.,Samani, N.J., et al. (2007). Genome-wide association study of 14,000 casesof seven common diseases and 3,000 shared controls. Nature, 447:661–678.Cox, D.R. (1958). Some problems connected with statistical inference. TheAnnals of Mathematical Statistics, 29:357–372.Eaton, M.L. (1989). Group Invariance Applications in Statistics. InstituteofMathematical Statistics, Hayward, CA.Hsu, J.C. and Berger, R.L. (1999). Stepwise confidence intervals withoutmultiplicity adjustment for dose-response and toxicity studies. Journal ofthe American Statistical Association, 94:468–482.Kiefer, J. (1977). Conditional confidence statements and confidence estimators.Journal of the American Statistical Association, 72:789–808.Mayo, D.G. (1996). Error and the Growth of Experimental Knowledge. Universityof Chicago Press, Chicago, IL.Neyman, J. (1977). Frequentist probability and frequentist statistics. Synthese,36: 97–131.Savage, L.J., Barnard, G., Cornfield, J., Bross, I., Box, G.E.P., Good, I.J.,Lindley, D.V., Clunies-Ross, C.W., Pratt, J.W., Levene, H. et al. (1962).On the foundations of statistical inference: Discussion. Journal of theAmerican Statistical Association, 57:307–326.Sellke, T., Bayarri, M., and Berger, J.O. (2001). Calibration of p-values fortesting precise null hypotheses. The American Statistician, 55:62–71.Stein, C. (1965). Approximation of improper prior measures by prior probabilitymeasures. In Bernoulli–Bayes–Laplace Festschrift. Springer,NewYork, pp. 217–240.Wolpert, R.L. (1996). Testing simple hypotheses. In Studies in Classification,Data Analysis, and Knowledge Organization, Vol. 7. Springer, New York,pp. 289–297.


24Statistical inference from aDempster–Shafer perspectiveArthur P. DempsterDepartment of StatisticsHarvard University, Cambridge, MA24.1 IntroductionWhat follows is a sketch of my 2013 viewpoint on how statistical inferenceshould be viewed by applied statisticians. The label DS is an acronym for“Dempster–Shafer” after the originators of the technical foundation of thetheory. Our foundation remains essentially unchanged since the 1960s and1970s when I and then Glenn Shafer were its initial expositors.Present issues concern why and how the theory has the potential to developinto a major competitor of the “frequentist” and “Bayesian” outlooks. This forme is a work in progress. My understanding has evolved substantially over thepast eight years of my emeritus status, during which DS has been my majorfocus. It was also a major focus of mine over the eight years beginning in 1961when I first had the freedom that came with academic tenure in the HarvardStatistics Department. Between the two periods I was more an observer andteacher in relation to DS than a primary developer. I do not attempt here toaddress the long history of how DS got to where I now understand it to be,including connections with R.A. Fisher’s controversial “fiducial” argument.DS draws on technical developments in fields such as stochastic modelingand Bayesian posterior computation, but my DS-guided perception of the natureof statistical inference is in different ways both narrower and broader thanthat of its established competitors. It is narrower because it maintains thatwhat “frequentist” statisticians call “inference” is not inference in the naturallanguage meaning of the word. The latter means to me direct situation-specificassessments of probabilistic uncertainties that I call “personal probabilities.”For example, I might predict on September 30, 2013 that with personal probability.31 the Dow Jones Industrials stock index will exceed 16,000 at the endof business on December 31, 2013.267


268 DS perspective on statistical inferenceFrom the DS perspective, statistical prediction, estimation, and significancetesting depend on understanding and accepting the DS logical framework,as implemented through model-based computations that mix probabilisticand deterministic logic. They do not depend on frequentist propertiesof hypothetical (“imagined”) long runs of repeated application of any definedrepeatable statistical procedure, which properties are simply mathematicalstatements about the procedure. Knowledge of such long run properties mayguide choosing among statistical procedures, but drawing conclusions from aspecific application of a chosen procedure is something else again.Whereas the logical framework of DS inference has long been defined andstable, and presumably will not change, the choice of a model to be processedthrough the logic must be determined by a user or user community in eachspecific application. It has long been known that DS logic subsumes Bayesianlogic. A Bayesian instantiation of DS inference occurs automatically withinthe DS framework when a Bayesian model is adopted by a user. My argumentfor the importance of DS logic is not primarily that it encapsulatesBayes, however, but is that it makes available important classes of modelsand associated inferences that narrower Bayesian models are unable to represent.Specifically, it provides models where personal probabilities of “don’tknow” are appropriately introduced. In particular, Bayesian “priors” becomeoptional in many common statistical situations, especially when DS probabilitiesof “don’t know” are allowed. Extending Bayesian thinking in this waypromises greater realism in many or even most applied situations.24.2 Personal probabilityDS probabilities can be studied from a purely mathematical standpoint, butwhen they have a role in assessing uncertainties about specific real world unknowns,they are meant for interpretation as “personal” probabilities. To myknowledge, the term “personal” was first used in relation to mathematicalprobabilities by Émile Borel in a book (Borel, 1939), and then in statistics byJimmie Savage, as far back as 1950, and subsequently in many short contributionspreceding his untimely death in 1971. Savage was primarily concernedwith Bayesian decision theory, wherein proposed actions are based on posteriorexpectations. From a DS viewpoint, the presence of decision componentsis optional.The DS inference paradigm explicitly recognizes the role of a user in constructingand using formal models that represent his or her uncertainty. Noother role for the application of probabilities is recognized. Ordinary speechoften describes empirical variation as “random,” and statisticians often regardprobabilities as mathematical representations of “randomness,” whichthey are, except that in most if not all of statistical practice “random” varia-


A.P. Dempster 269tion is simply unexplained variation whose associated probabilities are quiteproperly interpretable as personal numerical assessments of specific targeteduncertainties. Such models inevitably run a gamut from objective to subjective,and from broadly accepted to proposed and adopted by a single analystwho becomes the “person” in a personalist narrative. Good statistical practiceaims at the left ends of both scales, while actual practice necessarily makescompromises. As Jack Good used to say, “inference is possible.”The concept of personalist interpretation of specific probabilities is usuallywell understood by statisticians, but is mostly kept hidden as possibly unscientific.Nevertheless, all approaches to statistical inference imply the exercise ofmature judgment in the construction and use of formal models that integratedescriptions of empirical phenomena with prescriptions for reasoning underuncertainty. By limiting attention to long run averages, “frequentist” interpretationsare designed to remove any real or imagined taint from personalprobabilities, but paradoxically do not remove the presence of nonprobabilisticreasoning about deterministic long runs. The latter reasoning is just as personalistas the former. Why the fear of reasoning with personal probabilities,but not a similar fear of ordinary propositional logic? This makes no sense tome, if the goal is to remove any role for a “person” performing logical analysisin establishing valid scientific findings.I believe that, as partners in scientific inquiry, applied statisticians shouldseek credible models directly aimed at uncertainties through precisely formulateddirect and transparent reasoning with personal probabilities. I arguethat DS logic is at present the best available system for doing this.24.3 Personal probabilities of “don’t know”DS “gives up something big,” as John Tukey once described it to me, or asInowprefertodescribeit,bymodifyingtherootconceptsofpersonalprobabilities“for” and “against” that sum to one, by appending a third personalprobability of “don’t know.” The extra term adds substantially to the flexibilityof modeling, and to the expressiveness of inputs and outputs of DSprobabilistic reasoning.The targets of DS inference are binary outcomes, or equivalent assertionsthat the true state of some identified small world is either in one subset of thefull set of possible true states, or in the complementary subset. Under whatI shall refer to as the “ordinary” calculus of probability (OCP), the user isrequired to supply a pair of non-negative probabilities summing to one thatcharacterize “your” uncertainty about which subset contains the true state.DS requires instead that “you” adopt an “extended” calculus of probability(ECP) wherein the traditional pair of probabilities that the true state lies ordoes not lie in the subset associated with a targeted outcome is supplemented


270 DS perspective on statistical inferenceby a third probability of “don’t know,” where all three probabilities are nonnegativeand sum to one. It needs to be emphasized that the small worldof possible true states is characterized by binary outcomes interpretable astrue/false assertions, while uncertainties about such two-valued outcomes arerepresented by three-valued probabilities.For a given targeted outcome, a convenient notation for a three-valuedprobability assessment is (p, q, r), where p represents personal probability “for”the truth of an assertion, while q represents personal probability “against,”and r represents personal probability of “don’t know.” Each of p, q, and r isnon-negative and together they sum to unity. The outcome complementary toa given target associated with (p, q, r) has the associated personal probabilitytriple (q, p, r). The “ordinary” calculus is recovered from the “extended” calculusby limiting (p, q, r) uncertaintyassessmentstotheform(p, q, 0), or (p, q)for short. The “ordinary” calculus permits “you” to be sure that the assertionis true through (p, q, 0) = (1, 0, 0), or false through (p, q, r) =(0, 1, 0), whilethe “extended” calculus additionally permits (p, q, r) =(0, 0, 1), representingtotal ignorance.Devotees of the “ordinary” calculus are sometimes inclined, when confrontedwith the introduction of r>0, to ask why the extra term is needed.Aren’t probabilities (p, q) withp + q =1sufficienttocharacterizescientificand operational uncertainties? Who needs probabilities of “don’t know”? Oneanswer is that every application of a Bayesian model is necessarily based ona limited state space structure (SSS) that does not assess associated (p, q)probabilities for more inclusive state space structures. Such extended statespace structures realistically always exist, and may be relevant to reportedinferences. In effect, every Bayesian analysis makes implicit assumptions thatevidence about true states of variables omitted from an SSS is “independent”of additional probabilistic knowledge, including ECP expressions thereof, thatshould accompany explicitly identifiable state spaces. DS methodology makesavailable a wide range of models and analyses whose differences from narroweranalyses can point to “biases” due to the limitations of state spaces associatedwith reported Bayesian analyses. Failure of narrower assumptions often accentuatesnon-reproducibility of findings from non-DS statistical studies, castingdoubts on the credibility of many statistical studies.DS methodology can go part way at least to fixing the problem throughbroadening of state space structures and indicating plausible assignments ofpersonal probabilities of “don’t know” to aspects of broadened state spaces,including the use of (p, q, r) =(0, 0, 1) when no empirical basis “whatever” toquote Keynes exists for the use of “a good Benthamite calculation of a series ofprospective advantages and disadvantages, each multiplied by its appropriateprobability, waiting to he summed” that can be brought to bear. DS allows awide range of possible probabilistic uncertainty assessments between completeignorance and the fully “Benthamite” (i.e., Bayesian) models that Keynesrejected for many applications.


A.P. Dempster 271As the world of statistical analysis moves more and more to “big data”and associated “complex systems,” the DS middle ground can be expected tobecome increasingly important. DS puts no restraints on making state spacestructures as large as judged essential for bias protection, while the accompanyingincreases in many probabilities of “don’t know” will often require payingserious attention to the introduction of more evidence, including futureresearch studies. Contrary to the opinion of critics who decry all dependenceon mathematical models, the need is for more inclusive and necessarily morecomplex mathematical models that will continue to come on line as associatedinformation technologies advance.24.4 The standard DS protocolDeveloping and carrying out a DS analysis follows a prescribed sequence ofactivities and operations. First comes defining the state space structure, referredto henceforth by its acronym SSS. The purpose of initializing an SSSis to render precise the implied connection between the mathematical modeland a piece of the actual real world. Shafer introduced the insightful term“frame of discernment” for what I am calling the SSS. The SSS is a mathematicalset whose elements are the possible true values of some “small world”under investigation. The SSS is typically defined by a vector or multi-wayarray of variables, each with its own known or unknown true value. Such anSSS may be very simple, such as a vector of binary variables representing theoutcomes of successive tosses of a bent coin, some observed, and some such asfuture tosses remaining unobserved. Or, an SSS may be huge, based on a setof variables representing multivariate variation across situations that repeatacross times and spatial locations, in fields such as climatology, genomics, oreconomics.The requirement of an initialized SSS follows naturally from the desirabilityof clearly and adequately specifying at the outset the extent of admissiblequeries about the true state of a small world under analysis. Each such querycorresponds mathematically to a subset of the SSS. For example, before thefirst toss of a coin in an identified sequence of tosses, I might formulate aquery about the outcome, and respond by assigning a (p, q, r) personalprobabilitytriple to the outcome “head,” and its reordered (q, p, r) tripletothe“complementary” outcome “tail.” After the outcome “head” is observed andknown to “you,” the appropriate inference concerning the outcome “head” is(1, 0, 0), because the idealized “you” is sure about the outcome. The assertion“head on the first toss” is represented by the “marginal” subset of the SSSconsisting of all possible outcomes of all the variables beyond the first toss,which has been fixed by observation. A DS inference (1, 0, 0) associated with“head” signifies observed data.


272 DS perspective on statistical inferenceThe mindset of the user of DS methods is that each assignment of a (p, q, r)judgment is based on evidence. Evidence is a term of art, not a formal concept.Statistical data is one source of evidence. If the outcome of a sequence of n =10 tosses of a certain bent coin is observed to result in data HTTHHHTHTT,then each data point provides evidence about a particular toss, and querieswith (1, 0, 0) responses can be given with confidence concerning individualmargins of the 10-dimensional SSS. The user can “combine” these marginalinferences so as to respond to queries depending on subsets of the sequence,or about any interesting properties of the entire sequence.The DS notion of “combining” sources of evidence extends to cover probabilisticallyrepresented sources of evidence that combine with each other andwith data to produce fused posterior statistical inferences. This inference processcan be illustrated by revising the simple SSS of 10 binary variables toinclude a long sequence of perhaps N =10,000cointosses,ofwhichtheobservedn = 10 tosses are only the beginning. Queries may now be directed atthe much larger set of possible outcomes concerning properties of subsets, orabout any or all of the tosses, whether observed or not. We may be interestedprimarily in the long run fraction P of heads in the full sequence, then shrinkback the revised SSS to the sequence of variables X 1 ,...,X 10 , P ,whencetowork with approximate mathematics that treats N as infinite so that P maytake any real value on the closed interval [0, 1]. The resulting inference situationwas called “the fundamental problem of practical statistics” by KarlPearson in 1920 giving a Bayesian solution. It was the implicit motivation forJakob Bernoulli writing circa 1700 leading him to introduce binomial samplingdistributions. It was again the subject of Thomas Bayes’s seminal posthumous1763 note introducing what are now known as uniform Bayesian priors andassociated posterior distributions for an unknown P .My 1966 DS model and analysis for this most basic inference situation,when recast in 2013 terminology, is best explained by introducing a set of“auxiliary” variables U 1 ,...,U 10 that are assigned a uniform personal probabilitydistribution over the 10-dimensional unit cube. The U i do not representany real world quantities, but are simply technical crutches created for mathematicalconvenience that can be appropriately marginalized away in the endbecause inferences concerning the values of the U i have no direct real worldinterpretation.Each of the independent and identically distributed U i provides the connectionbetween a known X i and the target unknown P .Therelationshipsamong X i , P , and U i are already familiar to statisticians because they arewidely used to “computer-generate” a value of X i for given P .Specifically,my suggested relations areX i =1if0≤ U i ≤ P and X i =0ifP


A.P. Dempster 273when the X i are assumed known and inferences about P are sought. The justificationfor the less familiar “inverse” application is tied to the fundamentalDS “rule of combination” under “independence” that was the pivotal innovationof my earliest papers. The attitude here is that the X i and P describefeatures of the real world, with values that may be known or unknown accordingto context, thereby avoiding the criticism that P and X i must be viewedasymmetrically because P is “fixed” and X i is “random.” Under a personalistviewpoint the argument based on asymmetry is not germane. Either or bothof the two independence assumptions may be assumed according as P or X ior both have known values. (In the case of “both,” the U i become partiallyknown according to the above formulas.)Precise details of the concepts and operations of the “extended calculus ofprobability” (ECP) arise naturally when the X i are fixed, whence the aboverelations do not determine P uniquely, but instead limit P to an interval associatedwith personal probabilities determined by the U i . Under the ECPin its general form, personal probabilities are constructed from a distributionover subsets of the SSS that we call a “mass distribution.” The mass distributiondetermines by simple sums the (p, q, r) foranydesiredsubsetoftheSSS,according as mass is restricted to the subset in the case of p, or is restrictedto the complement of the subset in the case of q, or has positive accumulatedmass in both subsets in the case of r. DScombinationofindependentcomponentmass distributions involves both intersection of subsets as in propositionallogic, and multiplication of probabilities as in the ordinary calculus ofprobability (OCP). The ECP allows not only projecting a mass distribution“down” to margins, as in the OCP, but also inverse projection of a marginalmass distribution “up” to a finer margin of the SSS, or to an SSS that hasbeen expanded by adjoining arbitrarily many new variables. DS combinationtakes place in principle across input components of evidence that have beenprojected up to a full SSS, although in practice computational shortcuts areoften available. Combined inferences are then computed by projecting downto obtain marginal inferences of practical interest.Returning to the example called the “fundamental problem of practicalstatistics,” it can be shown that the result of operating with the inputs of datadeterminedlogical mass distributions, together with inputs of probabilisticmass distributions based on the U i ,leadstoaposteriormassdistributionthat in effect places the unknown P on the interval between the Rth and(R + 1)st ordered values of the U i ,whereR denotes the observed number of“heads” in the n observed trials. This probabilistic interval is the basis forsignificance testing, estimation and prediction.To test the null hypothesis that P = .25, for example, the user computesthe probability p that the probabilistic interval for P is either completely tothe right or left of P = .25. The complementary 1 − p is the probability rthat the interval covers P = .25, because there is zero probability that theinterval shrinks to precisely P .Thus(p, 0,r)isthetriplecorrespondingtotheassertion that the null hypothesis fails, and r =1−p replaces the controversial


274 DS perspective on statistical inference“p-value” of contemporary applied statistics. The choice offered by such a DSsignificance test is not to either “accept” or “reject” the null hypothesis, butinstead is either to “not reject” or “reject.”In a similar vein a (p, q, r) triplecanbeassociatedwithanyspecifiedrange of P values, such as the interval (.25,.75), thus creating an “intervalestimate.” Similarly, if so desired, a “sharp” null hypothesis such as P = .25can be rendered “dull” using an interval such as (.24,.26). Finally, if theSSS is expanded to include a future toss or tosses, then (p, q, r) “predictions”concerning the future outcomes of such tosses can be computed given observedsample data.There is no space here to set forth details concerning how independentinput mass distributions on margins of an SSS are up-projected to mass distributionson the full SSS, and are combined there and used to derive inferencesas in the preceding paragraphs. Most details have been in place, albeitusing differing terminology, since the 1960s. The methods are remarkably simpleand mathematically elegant. It is surprising to me that research on thestandard protocol has not been taken up by any but an invisible sliver of themathematical statistics community.The inference system outlined in the preceding paragraphs can and shouldbe straightforwardly developed to cover many or most inference situationsfound in statistical textbooks. The result will not only be that traditionalBayesian models and analyses can be re-expressed in DS terms, but more significantlythat many “weakened” modifications of such inferences will becomeapparent, for example, by replacing Bayesian priors with DS mass distributionsthat demand less in terms of supporting evidence, including limiting“total ignorance” priors concerning “parameter” values. In the case of such(0, 0, 1)-based priors, traditional “likelihood functions” assume a restated DSform having a mass distribution implying stand-alone DS inferences. But whena “prior” includes limited probabilities of “don’t know,” the OCP “likelihoodprinciple” no longer holds, nor is it needed. It also becomes easy in principleto “weaken” parametric forms adopted in likelihood functions, for example,by exploring DS analyses that do not assume precise normality, but mightassume that cumulative distribution functions (CDFs) are within, say, .10 ofa Gaussian CDF. Such “robustness” research is in its infancy, and is withoutfinancial support, to my knowledge, at the present time.The concepts of DS “weakening,” or conversely “strengthening,” providebasic tools of model construction and revision for a user to consider in thecourse of arriving at final reports. In particular, claims about complex systemsmay be more appropriately represented in weakened forms with increasedprobabilities of “don’t know.”


A.P. Dempster 27524.5 Nonparametric inferenceWhen I became a PhD student in the mid-50s, Sam Wilks suggested to methat the topic of “nonparametric” or “distribution-free” statistical inferencehad been largely worked through in the 1940s, in no small part through hisefforts, implying that I might want to look elsewhere for a research topic.I conclude here by sketching how DS could introduce new thinking that goesback to the roots of this important topic.Auseofbinomialsamplingprobabilitiessimilartothatinmycoin-tossingexample arises in connection with sampling a univariate continuous observable.In a 1939 obituary for “Student” (W.S. Gosset), Fisher recalled thatGosset had somewhere remarked that given a random sample of size 2 with acontinuous observable, the probability is 1/2 that the population median liesbetween the observations, with the remaining probabilities 1/4 and 1/4 evenlysplit between the two sides of the data. In a footnote, Fisher pointed out howStudent’s remark could be generalized to use binomial sampling probabilitiesto locate with computed probabilistic uncertainty any nominated populationquantile in each of the n +1intervalsdeterminedbythedata.InDSterms,the same ordered uniformly distributed auxiliaries used in connection with“binomial” sampling a dichotomy extend easily to provide marginal mass distributionposteriors for any unknown quantile of the population distribution,not just the quantiles at the observed data points. When the DS analysis isextended to placing an arbitrary population quantile in intervals other thanexactly determined by the observations, (p, q, r) inferences arise that in generalhave r>0, including r = 1 for assertions concerning the population CDFin regions in the tails of the data beyond the largest and smallest observations.In addition, DS would have allowed Student to extend his analysis to predictthat a third sample draw can be predicted to lie in each of the three regionsdetermined by the data with equal probabilities 1/3, or more generally withequal probabilities 1/(n+1) in the n+1 regions determined by n observations.The “nonparametric” analysis that Fisher pioneered in his 1939 footnoteserves to illustrate DS-ECP logic in action. It can also serve to illustrate theneed for critical examination of particular models and consequent analyses.Consider the situation of a statistician faced with analysis of a sample ofmodest size, such as n = 30, where a casual examination of the data suggeststhat the underlying population distribution has a smooth CDF but does notconform to an obvious simple parametric form such as Gaussian. After plottingthe data, it would not be surprising to see that the lengths of intervals betweensuccessive ordered data point over the middle ranges of the data vary by afactor of two or three. The nonparametric model asserts that these intervalshave equal probabilities 1/(n + 1) = 1/31 of containing the next draw, but abroker offering both sides of bets based on these probabilities would soon belosing money because takers would bet with higher and lower probabilities for


276 DS perspective on statistical inferencelonger and shorter intervals. The issue here is a version of the “multiplicity”problem of applied inferential statistics.Several modified DS-ECP strategies come to mind. One idea is to“strengthen” the broad assumption that allows all continuous populationCDFs, by imposing greater smoothness, for example by limiting considerationto population distributions with convex log probability density function.If the culprit is that invariance over all monotone continuous transforms ofthe scale of the observed variable is too much, then maybe back off to justlinear invariance as implied by a convexity assumption. Alternatively, if it isdesired to retain invariance over all monotone transforms, then the auxiliariescan be “weakened” by requiring “weakened” auxiliary “don’t know” termsto apply across ranges of intervals between data points. The result would bebets with increased “don’t know” that could help protect the broker againstbankruptcy.24.6 Open areas for researchMany opportunities exist for both modifying state space structures throughalternative choices that delete some variables and add others. The robustnessof simpler models can be studied when only weak or nonexistent personalprobability restrictions can be supported by evidence concerning the effects ofadditional model complexity. Many DS models that mimic standard multiparametermodels can be subjected to strengthening or weakening modifications.In particular, DS methods are easily extended to discount for cherry-pickingamong multiple inferences. There is no space here to survey a broad rangeof stochastic systems and related DS models that can be re-expressed andmodified in DS terms.DS versions of decision theory merit systematic development and study.Decision analysis assumes a menu of possible actions each of which is associatedwith a real-valued utility function defined over the state space structure(SSS). Given an evidence-based posterior mass distribution over the SSS, eachpossible action has an associated lower expectation and upper expectation definedin an obvious way. The lower expectation is interpreted as “your” guaranteedexpected returns from choosing alternative actions, so is a reasonablecriterion for “optimal” decision-making.In the case of simple bets, two or more players compete for a defined prizewith their own DS mass functions and with the same utility function on thesame SSS. Here, Borel’s celebrated observation applies:“It has been said that, given two bettors, there is always one thiefand one imbecile; that is true in certain cases when one of the twobettors is much better informed than the other, and knows it; but it


A.P. Dempster 277can easily happen that two men of good faith, in complex matterswhere they possess exactly the same elements of information, arriveat different conclusions on the probabilities of an event, and that inbetting together, each believes... that it is he who is the thief and theother the imbecile.” (Borel, 1924)Atypicalmodernapplicationinvolveschoicesamongdifferentinvestmentopportunities, through comparisons of DS posterior lower expectations of futuremonetary gains for buyers or losses for sellers, and for risk-taking brokerswho quote bid/ask spreads while bringing buyers and sellers together. Formathematical statisticians, the field of DS decision analysis is wide open forinvestigations of models and analyses, of interest both mathematically andfor practical use. For the latter in particular there are many potentially usefulmodels to be defined and studied, and to be implemented numerically bysoftware with acceptable speed, accuracy, and cost.A third area of potential DS topics for research statisticians concernsmodeling and inference for “parametric” models, as originally formulated byR.A. Fisher in his celebrated pair of 1920s papers on estimation. The conceptof a statistical parameter is plagued by ambiguity. I believe that the termarose in parallel with similar usage in physics. For example, the dynamics ofphysical systems are often closely approximated by classical Newtonian physicallaws, but application of the laws can depend on certain “parameters”whose actual numerical values are left “to be determined.” In contemporarybranches of probabilistic statistical sciences, stochastic models are generallydescribed in terms of parameters similarly left “to be determined” prior todirect application. The mathematics is clear, but the nature of the task ofparameter determination for practical application is murky, and in statisticsis a chief source of contention between frequentists and Bayesians.In many stochastic sampling models, including many pioneered by Fisher,parameters such as population fractions, means, and standard deviations, canactually represent specific unknown real world population quantities. Oftentimes,however, parameters are simply ad hoc quantities constructed on thefly while “fitting” mathematical forms to data. To emphasize the distinction,I like to denote parameters of the former type by Roman capital letters such asP , M, and S, while denoting analogous “parameters” fitted to data by Greekletters π, µ and σ. Thedistinctionhereisimportant,becausetheparametersof personalist statistical science draw on evidence and utilities that canonly be assessed one application at a time. “Evidence-based” assumptions, inparticular, draw upon many types of information and experience.Very little published research exists that is devoted to usable personalistmethodology and computational software along the lines of the DS standardprotocol. Even the specific DS sampling model for inference about a populationwith k exchangeable categories that was proposed in my initial 1966paper in The Annals of Mathematical Statistics has not yet been implementedand analyzed beyond the trivially simple case of k = 2. I published a briefreport in 2008 on estimates and tests for a Poisson parameter L, whichisa


278 DS perspective on statistical inferencelimiting case of the binomial P . I have given eight or so seminars and lecturesin recent years, more in Europe than in North America, and largely outsideof the field of statistics proper. I have conducted useful correspondences withstatisticians, most notably Paul Edlefsen, Chuanhai Liu, and Jonty Rougier,for which I am grateful.Near the end of my 1966 paper I sketched an argument to the effect thata limiting case of my basic multinomial model for data with k exchangeablecells, wherein the observables become continuous in the limit, leads to definedlimiting DS inferences concerning the parameters of widely adopted parametricsampling models. This theory deserves to pass from conjecture to theorem,because it bears on how specific DS methods based on sampling models generalizetraditional methods based on likelihood functions that have been studiedby mathematical statisticians for most of the 20th century. Whereas individuallikelihood functions from each of the n data points in a random sample multiplyto yield the combined likelihood function for a complete random sample ofsize n, the generalizing DS mass distributions from single observations combineunder the DS combination rule to provide the mass distribution for DSinference under the full sample. In fact, values of likelihood functions are seento be identical to “upper probabilities” p + r obtained from DS (p, q, r) inferencesfor singleton subsets of the parameters. Ordinary likelihood functionsare thus seen to provide only part of the information in the data. What theylack is information associated with the “don’t know” feature of the extendedcalculus of probability.Detailed connections between the DS system and its predecessors inviteilluminating research studies. For example, mass distributions that generalizelikelihood functions from random samples combine under the DS rule ofcombination with prior mass distributions to provide DS inferences that generalizetraditional Bayesian inferences. The use of the term “generalize” refersspecifically to the recognition that when the DS prior specializes to an ordinarycalculus of probability (OCP) prior, the DS combination rule reproducesthe traditional “posterior = prior × likelihood” rule of traditional Bayesianinference. In an important sense, the DS rule of combination is thus seen togeneralize the OCP axiom that the joint probability of two events equals themarginal probability of one multiplied by the conditional probability of theother given the first.I believe that an elegant theory of parametric inference is out there justwaiting to be explained in precise mathematical terms, with the potential torender moot many of the confusions and controversies of the 20th century overstatistical inference.


A.P. Dempster 279ReferencesBorel, É. (1924). À propos d’un traité deprobabilité. Revue philosophique98. English translation: Apropos of a treatise on probability. Studies inSubjective Probability (H.E. Kyburg Jr. and H.E. Smokler, Eds.). Wiley,New York (1964).Borel, É. (1939). Valeur pratique et philosophie des probabilités. Gauthiers-Villars, Paris.Dempster, A.P. (1966). New methods for reasoning towards posterior distributionsbased on sample data. The Annals of Mathematical Statistics,37:355–374.Dempster, A.P. (1967). Upper and lower probabilities induced by a multivaluedmapping. The Annals of Mathematical Statistics, 38:325–339.Dempster, A.P. (1968). A generalization of Bayesian inference (with discussion).Journal of the Royal Statistical Society, Series B, 30:205–247.Dempster, A.P. (1990). Bayes, Fisher, and belief functions. In Bayesian andLikelihood Methods in Statistics and Econometrics: Essays in Honor ofGeorge Barnard (S. Geisser, J.S. Hodges, S.J. Press, and A. Zellner, Eds.).North-Holland, Amsterdam.Dempster, A.P. (1990). Construction and local computation aspects of networkedbelief functions. In Influence Diagrams, Belief Nets and DecisionAnalysis (R.M. Oliver and J.Q. Smith, Eds.). Wiley, New York.Dempster, A.P. (1998). Logicist statistics I. Models and modeling. StatisticalScience, 13:248–276.Dempster, A.P. (2008). Logicist Statistics II. Inference. (Revised version ofthe 1998 COPSS Fisher Memorial Lecture) Classic Works of Dempster–Shafer Theory of Belief Functions (L. Liu and R.R. Yager, Eds.).Dempster, A.P. (2008). The Dempster–Shafer calculus for statisticians. InternationalJournal of Approximate Reasoning, 48:365–377.Dempster, A.P. and Chiu, W.F. (2006). Dempster–Shafer models for objectrecognition and classification. International Journal of Intelligent Systems,21:283–297.Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics.Philosophical Transactions, Series A, 222:309–368.Fisher, R.A. (1925). Theory of statistical estimation. Proceedings of the CambridgePhilosophical Society, 22:200–225.


280 DS perspective on statistical inferenceFisher, R.A. (1939). “Student.” Annals of Eugenics, 9:1–9.Keynes, J.M. (1937). The general theory of employment. Quarterly Journalof Economics, 51:209–223.Pearson, K. (1920). The fundamental problem of practical statistics.Biometrika, 20:1–16.Savage, L.J. (1950). The role of personal probability in statistics. Econometrica,18:183–184.Savage, L.J. (1981). The Writings of Leonard Jimmie Savage — A MemorialSelection. Published by the American Statistical Association and theInstitute of Mathematical Statistics, Washington, DC.Shafer, G. (1976). AMathematicalTheoryofEvidence.PrincetonUniversityPress, Princeton, NJ.Wilks, S.S. (1948). Order statistics. Bulletin of the American MathematicalSociety, 54:6–48.


25Nonparametric BayesDavid B. DunsonDepartment of Statistical ScienceDuke University, Durham, NCI reflect on the past, present, and future of nonparametric Bayesian statistics.Current nonparametric Bayes research tends to be split between theoreticalstudies, seeking to understand relatively simple models, and machine learning,defining new models and computational algorithms motivated by practicalperformance. I comment on the current landscape, open problems andpromising future directions in modern big data applications.25.1 Introduction25.1.1 Problems with parametric BayesIn parametric Bayesian statistics, one chooses a likelihood function L(y|θ)for data y, which is parameterized in terms of a finite-dimensional unknownθ. Choosingapriordistributionforθ, one updates this prior with the likelihoodL(y|θ) viaBayes’ruletoobtaintheposteriordistributionπ(θ|y) forθ. This framework has a number of highly appealing characteristics, rangingfrom flexibility to the ability to characterize uncertainty in θ in an intuitivelyappealing probabilistic manner. However, one unappealing aspect is the intrinsicassumption that the data were generated from a particular probabilitydistribution (e.g., a Gaussian linear model).There are a number of challenging questions that arise in considering,from both philosophical and practical perspectives, what happens when suchan assumption is violated, as is arguably always the case in practice. From aphilosophical viewpoint, if one takes a parametric Bayesian perspective, thena prior is being assumed that has support on a measure zero subset of the setof possible distributions that could have generated the data. Of course, as it iscommonly accepted that all models are wrong, it seems that such a prior doesnot actually characterize any individual’s prior beliefs, and one may question


282 Nonparametric Bayesthe meaning of the resulting posterior from a subjective Bayes perspective. Itwould seem that a rational subjectivist would assign positive prior probabilityto the case in which the presumed parametric model is wrong in unanticipatedways, and probability zero to the case in which the data are generated exactlyfrom the presumed model. Objective Bayesians should similarly acknowledgethat any parametric model is wrong, or at least has a positive probability ofbeing wrong, in order to truly be objective. It seems odd to spend an enormousamount of effort showing that a particular prior satisfies various objectivityproperties in a simple parametric model, as has been the focus of much of theBayes literature.The failure to define a framework for choosing priors in parametric models,which acknowledge that the “working” model is wrong, leads to some clearpractical issues with parametric Bayesian inference. One of the major ones isthe lack of a framework for model criticism and goodness-of-fit assessments.Parametric Bayesians assume prior knowledge of the true model which generatedthe data, and hence there is no allowance within the Bayesian frameworkfor incorrect model choice. For this reason, the literature on Bayesiangoodness-of-fit assessments remains under-developed, with most of the existingapproaches relying on diagnostics that lack a Bayesian justification. Apartial solution is to place a prior distribution over a list of possible modelsinstead of assuming a single model is true apriori. However, such Bayesianmodel averaging/selection approaches assume that the true model is one ofthose in the list, the so-called M-closed viewpoint, and hence do not solve thefundamental problem.An alternative pragmatic view is that it is often reasonable to operateunder the working assumption that the presumed model is true. Certainly,parametric Bayesian and frequentist inferences often produce excellent resultseven when the true model deviates from the assumptions. In parametricBayesian models, it tends to be the case that the posterior distribution forthe unknown θ will concentrate at the value θ 0 ,whichyieldsasamplingdistributionthat is as close as possible to the true data-generating model interms of the Kullback–Leibler (KL) divergence. As long as the parametricmodel provides an “adequate” approximation, and this divergence is small,it is commonly believed that inferences will be “reliable.” However, there hasbeen some research suggesting that this common belief is often wrong, suchas when the loss function is far from KL (Owhadi et al., 2013).Results of this type have provided motivation for “quasi” Bayesian approaches,which replace the likelihood with other functions (Chernozhukovand Hong, 2003). For example, quantile-based substitution likelihoods havebeen proposed, which avoid specifying the density of the data between quantiles(Dunson and Taylor, 2005). Alternatively, motivated by avoiding specificationof parametric marginal distributions in considering copula dependencemodels (Genest and Favre, 2007; Hoff, 2007; Genest and Nešlehová, 2012;Murray et al., 2013), use an extended rank-based likelihood. Recently, theidea of a Gibbs posterior (Jiang and Tanner, 2008; Chen et al., 2010) was in-


D.B. Dunson 283troduced, providing a generalization of Bayesian inference using a loss-basedpseudo likelihood. Appealing properties of this approach have been shown invarious contexts, but it is still unclear whether such methods are appropriatelycalibrated so that the quasi posterior distributions obtained provide avalid measure of uncertainty. It may be the case that uncertainty intervalsare systematically too wide or too narrow, with asymptotic properties such asconsistency providing no reassurance that uncertainty is well characterized.Fully Bayesian nonparametric methods require a full characterization ofthe likelihood, relying on models with infinitely-many parameters having carefullychosen priors that yield desirable properties. In the remainder of thischapter, I focus on such approaches.25.1.2 What is nonparametric Bayes?Nonparametric (NP) Bayes seeks to solve the above problems by choosinga highly flexible prior, which assigns positive probability to arbitrarily smallneighborhoods around any true data-generating model f 0 in a large class.For example, as an illustration, consider the simple case in which y 1 ,...,y nform a random sample from density f. A parametric Bayes approach wouldparameterize the density f in terms of finitely-many unknowns θ, and induceapriorforf through a prior for θ. Such a prior will in general have support ona vanishingly small subset of the set of possible densities F (e.g., with respectto Lebesgue measure on R). NP Bayes instead lets f ∼ Π, with Π a priorover F having large support, meaning that Π{f : d(f,f 0 ) 0 for somedistance metric d, any ɛ>0, and any f 0 in a large subset of F. Largesupportis the defining property of an NP Bayes approach, and means that realizationsfrom the prior have a positive probability of being arbitrarily close to any f 0 ,perhaps ruling out some irregular ones (say with heavy tails).In general, to satisfy the large support property, NP Bayes probabilitymodels include infinitely-many parameters and involve specifying stochasticprocesses for random functions. For example, in the density estimation example,a very popular prior is a Dirichlet process mixture (DPM) of Gaussians(Lo, 1984). Under the stick-breaking representation of the Dirichlet process(Sethuraman, 1994), such a prior letsf(y) =∞∑h=1π h N (y; µ h ,τ −1h ), (µ h,τ h ) iid∼ P 0 , (25.1)where the weights on the normal kernels follow a stick-breaking process, π h =V h∏l


284 Nonparametric BayesPrior (25.1) is intuitively appealing in including infinitely-many Gaussiankernels having stochastically decreasing weights. In practice, there will tend tobe a small number of kernels having large weights, with the remaining havingvanishingly small weights. Only a modest number of kernels will be occupiedby the subjects in a sample, so that the effective number of parameters mayactually be quite small and is certainly not infinite, making posterior computationand inferences tractable. Of course, prior (25.1) is only one particularlysimple example (to quote Andrew Gelman, “No one is really interested in densityestimation”), and there has been an explosion of literature in recent yearsproposing an amazing variety of NP Bayes models for broad applications anddata structures.Section 25.2 contains an (absurdly) incomplete timeline of the history ofNP Bayes up through the present. Section 25.3 comments briefly on interestingfuture directions.25.2 A brief history of NP BayesAlthough there were many important earlier developments, the modern viewof nonparametric Bayes statistics was essentially introduced in the papers ofFerguson (1973, 1974), which proposed the Dirichlet process (DP) along withseveral ideal criteria for a nonparametric Bayes approach including large support,interpretability and computational tractability. The DP provides a priorfor a discrete probability measure with infinitely many atoms, and is broadlyemployed within Bayesian models as a prior for mixing distributions and forclustering. An equally popular prior is the Gaussian process (GP), which isinstead used for random functions or surfaces. A non-neglible proportion ofthe nonparametric Bayes literature continues to focus on theoretical properties,computational algorithms and applications of DPs and GPs in variouscontexts.In the 1970s and 1980s, NP Bayes research was primarily theoretical andconducted by a narrow community, with applications focused primarily onjointly conjugate priors, such as simple cases of the gamma process, DP andGP. Most research did not consider applications or data analysis at all, butinstead delved into characterizations and probabilistic properties of stochasticprocesses, which could be employed as priors in NP Bayes models. Thesedevelopments later had substantial applied implications in facilitating computationand the development of richer model classes.With the rise in computing power, development of Gibbs sampling andexplosion in use of Markov chain Monte Carlo (MCMC) algorithms in theearly 1990s, nonparametric Bayes methods started to become computationallytractable. By the late 1990s and early 2000s, there were a rich variety ofinferential algorithms available for general DP mixtures and GP-based mod-


D.B. Dunson 285els in spatial statistics, computer experiments and beyond. These algorithms,combined with increasing knowledge of theoretical properties and characterizations,stimulated an explosion of modeling innovation starting in the early2000s but really gaining steam by 2005. A key catalyst in this exponentialgrowth of research activity and innovation in NP Bayes was the dependentDirichlet process (DDP) of Steve MacEachern, which ironically was neverpublished and is only available as a technical report. The DDP and other keymodeling innovations were made possible by earlier theoretical work providingcharacterizations, such as stick-breaking (Sethuraman, 1994; Ishwaran andJames, 2001) and the Polya urn scheme/Chinese restaurant process (Blackwelland MacQueen, 1973). Some of the circa 2005–10 innovations include theIndian buffet process (IBP) (Griffiths and Ghahramani, 2011), the hierarchicalDirichlet process (HDP) (Teh et al., 2006), the nested Dirichlet process(Rodríguez et al., 2008), and the kernel stick-breaking process (Dunson andPark, 2008).One of the most exciting aspects of these new modeling innovations wasthe potential for major applied impact. I was fortunate to start working onNP Bayes just as this exponential growth started to take off. In the NP Bayesstatistics community, this era of applied-driven modeling innovation peakedat the 2007 NP Bayes workshop at the Issac Newton Institute at CambridgeUniversity. The Newton Institute is an outstanding facility and there was anenergy and excited vibe permeating the workshop, with a wide variety of topicsbeing covered, ranging from innovative modeling driven by biostatisticalapplications to theoretical advances on properties. One of the most excitingaspects of statistical research is the ability to fully engage in a significantapplied problem, developing methods that really make a practical differencein inferences or predictions in the motivating application, as well as in otherrelated applications. To me, it is ideal to start with an applied motivation,such as an important aspect of the data that is not captured by existing statisticalapproaches, and then attempt to build new models and computationalalgorithms that have theoretical support and make a positive difference to thebottom-line answers in the analysis. The flexibility of NP Bayes models makesthis toolbox ideal for attacking challenging applied problems.Although the expansion of the NP Bayes community and impact of theresearch has continued since the 2007 Newton workshop, the trajectory andflavor of the work has shifted substantially in recent years. This shift is duein part to the emergence of big data and to some important cultural hurdles,which have slowed the expansion of NP Bayes in statistics and scientificapplications, while stimulating increasing growth in machine learning. Culturally,statisticians tend to be highly conservative, having a healthy skepticismof new approaches even if they seemingly improve practical performance inprediction and simulation studies. Many statisticians will not really trust anapproach that lacks asymptotic justification, and there is a strong preferencefor simple methods that can be studied and understood more easily. This is


286 Nonparametric Bayesperhaps one reason for the enormous statistical literature on minor variationsof the lasso.NP Bayes methods require more of a learning curve. Most graduate programsin statistics have perhaps one elective course on Bayesian statistics,and NP Bayes is not a simple conceptual modification of parametric Bayes.Often models are specified in terms of infinite-dimensional random probabilitymeasures and stochastic processes. On the surface, this seems daunting andthe knee-jerk reaction by many statisticians is negative, mentioning unnecessarycomplexity, concerns about over-fitting, whether the data can reallysupport such complexity, lack of interpretability, and limited understandingof theoretical properties such as asymptotic behavior. This reaction restrictsentry into the field and makes it more difficult to get publications and grantfunding.However, these concerns are largely unfounded. In general, the perceivedcomplexity of NP Bayes models is due to lack of familiarity. Canonical modelclasses, such as DPs and GPs, are really quite simple in their structure andtend to be no more difficult to implement than flexible parametric models.The intrinsic Bayesian penalty for model complexity tends to protect againstover-fitting. For example, consider the DPM of Gaussians for density estimationshown in equation (25.1). The model is simple in structure, being adiscrete mixture of normals, but the perceived complexity comes in throughthe incorporation of infinitely many components. For statisticians unfamiliarwith the intricacies of such models, natural questions arise such as “howcan the data inform about all these parameters” and “there certainly mustbe over-fitting and huge prior sensitivity.” However, in practice, the priorand the penalty that comes in through integrating over the prior in derivingthe marginal likelihood tends to lead to allocation of all the individuals in thesample to relatively few clusters. Hence, even though there are infinitely manycomponents, only a few of these are used and the model behaves like a finitemixture of Gaussians, with sieve behavior in terms of using more componentsas the sample size increases. Contrary to the concern about over-fitting, thetendency is instead to place a high posterior weight on very few components,potentially under-fitting in small sample sizes. DPMs are a simple examplebut the above story applies much more broadly.The lack of understanding in the broad statistical community of the behaviorof NP Bayes procedures tempered some of the enthusiastic applicationsdrivenmodeling of the 2000s, motivating an emerging field focused on studyingfrequentist asymptotic properties. There is a long history of NP Bayes asymptotics,showing properties such as consistency and rates of concentration ofthe posterior around the true unknown distribution or function. In the pastfive years, this field has really taken off and there is now a rich literatureshowing strong properties ranging from minimax optimal adaptive rates ofposterior concentration (Bhattacharya et al., 2013) to Bernstein–von Misesresults characterizing the asymptotic distribution of functionals (Rivoirardand Rousseau, 2012). Such theorems can be used to justify many NP Bayes


D.B. Dunson 287methods as also providing an optimal frequentist procedure, while allowingfrequentist statisticians to exploit computational methods and probabilisticinterpretations of Bayes methods. In addition, an appealing advantage of NPBayes nonparametric methods is the allowance for uncertainty in tuning parameterchoice through hyperpriors, bypassing the need for cross-validation.The 2013 NP Bayes conference in Amsterdam was notable in exhibiting a dramaticshift in topics compared with the 2007 Newton conference, away fromapplications-driven modeling and towards asymptotics.The other thread that was very well represented in Amsterdam was NPBayes machine learning, which has expanded into a dynamic and importantarea. The machine learning (ML) community is fundamentally different culturallyfrom statistics, and has had a very different response to NP Bayes methodsas a result. In particular, ML tends to be motivated by applications in whichbottom-line performance in metrics, such as out-of-sample prediction, takescenter stage. In addition, the ML community prefers peer-reviewed proceedingsfor conferences, such as Neural Information Processing Systems (NIPS)and the International Conference on Machine Learning Research (ICML), overjournal publications. These conference proceedings are short papers, and thereis an emphasis on innovative new ideas which improve bottom line performance.ML researchers tend to be aggressive and do not shy away from newapproaches which can improve performance regardless of complexity. A substantialproportion of the novelty in NP Bayes modeling and computation hascome out of the ML community in recent years. With the increased emphasison big data across fields, the lines between ML and statistics have beenblurring. However, publishing an initial idea in NIPS or ICML is completelydifferent than publishing a well-developed and carefully thought out methodspaper in a leading statistical theory and methods journal, such as the Journalof the American Statistical Association, Biometrika or the Journal of theRoyal Statistical Society, Series B. Myownresearchhasgreatlybenefitedbystraddling the asymptotic, ML and applications-driven modeling threads, attemptingto develop practically useful and innovative new NP Bayes statisticalmethods having strong asymptotic properties.25.3 Gazing into the futureMoving into the future, NP Bayes methods have rich promise in terms ofproviding a framework for attacking a very broad class of ‘modern’ problemsinvolving high-dimensional and complex data. In big complex data settings, itis much more challenging to do model checking and to carefully go through thetraditional process of assessing the adequacy of a parametric model, makingrevisions to the model as appropriate. In addition, when the number of variablesis really large, it becomes unlikely that a particular parametric model


288 Nonparametric Bayesworks well for all these variables. This is one of the reasons that ensembleapproaches, which average across many models/algorithms, tend to producestate of the art performance in difficult prediction tasks. Combining many simplemodels, each able to express different characteristics of the data, is usefuland similar conceptually to the idea of Bayesian model averaging (BMA),though BMA is typically only implemented within a narrow parametric class(e.g., normal linear regression).In considering applications of NP Bayes in big data settings, several questionsarise. The first is “Why bother?” In particular, what do we have to gainover the rich plethora of machine learning algorithms already available, andwhich are being refined and innovated upon daily by thousands of researchers?There are clear and compelling answers to this question. ML algorithms almostalways rely on convex optimization to obtain a point estimate, and uncertaintyis seldom of much interest in the ML community, given the types of applicationsthey are faced with. In contrast, in most scientific applications, predictionis not the primary interest and one is usually focused on inferences thataccount for uncertainty. For example, the focus may be on assessing the conditionalindependence structure (graphical model) relating genetic variants,environmental exposures and cardiovascular disease outcomes (an applicationI’m currently working on). Obtaining a single estimate of the graph is clearlynot sufficient, and would be essentially uninterpretable. Indeed, such graphsproduced by ML methods such as graphical lasso have been deemed “ridiculograms.”They critically depend on a tuning parameter that is difficult tochoose objectively and produce a massive number of connections that cannotbe effectively examined visually. Using an NP Bayes approach, we could insteadmake highly useful statements (at least according to my collaborators),such as (i) the posterior probability that genetic variants in a particular geneare associated with cardiovascular disease risk, adjusting for other factors, isP %; or (ii) the posterior probability that air pollution exposure contributes torisk, adjusted for genetic variants and other factors, is Q%. We can also obtainposterior probabilities of an edge between each variable without parametricassumptions, such as Gaussianity. This is just one example of the utility ofprobabilistic NP Bayes models; I could list dozens of others.The question then is why aren’t more people using and working on thedevelopment of NP Bayes methods? The answer to the first part of this questionis clearly computational speed, simplicity and accessibility. As mentionedabove, there is somewhat of a learning curve involved in NP Bayes, which isnot covered in most graduate curriculums. In contrast, penalized optimizationmethods, such as the lasso, are both simple and very widely taught. In addition,convex optimization algorithms for very rapidly implementing penalizedoptimization, especially in big data settings, have been highly optimized andrefined in countless publications by leading researchers. This has led to simplemethods that are scalable to big data, and which can exploit distributedcomputing architectures to further scale up to enormous settings. Researchersworking on these types of methods often have a computer science or engineer-


D.B. Dunson 289ing background, and in the applications they face, speed is everything andcharacterizing uncertainty in inference or testing is just not a problem theyencounter. In fact, ML researchers working on NP Bayes methods seldom reportinferences or use uncertainty in their analyses; they instead use NP Bayesmethods combined with approximations, such as variational Bayes or expectationpropagation, to improve performance on ML tasks, such as prediction.Often predictive performance can be improved, while avoiding cross-validationfor tuning parameter selection, and these gains have partly led to the relativepopularity of NP Bayes in machine learning.It is amazing to me how many fascinating and important unsolved problemsremain in NP Bayes, with the solutions having the potential to substantiallyimpact practice in analyzing and interpreting data in many fields.For example, there is no work on the above nonparametric Bayes graphicalmodeling problem, though we have developed an initial approach we willsubmit for publication soon. There is very limited work on fast and scalableapproximations to the posterior distribution in Bayesian nonparametric models.Markov chain Monte Carlo (MCMC) algorithms are still routinely useddespite their problems with scalability due to the lack of decent alternatives.Variational Bayes and expectation propagation algorithms developed in MLlack theoretical guarantees and often perform poorly, particularly when thefocus goes beyond obtaining a point estimate for prediction. Sequential MonteCarlo (SMC) algorithms face similar scalability problems to MCMC, with adaunting number of particles needed to obtain adequate approximations forhigh-dimensional models. There is a clear need for new models for flexibledimensionality reduction in broad settings. There is a clear lack of approachesfor complex non-Euclidean data structures, such as shapes, trees, networksand other object data.Ihopethatthischapterinspiresatleastafewyoungresearcherstofocuson improving the state of the art in NP Bayes statistics. The most effectivepath to success and high impact in my view is to focus on challengingreal-world applications in which current methods have obvious inadequacies.Define innovative probability models for these data, develop new scalable approximationsand computational algorithms, study the theoretical properties,implement the methods on real data, and provide software packages for routineuse. Given how few people are working in such areas, there are manylow hanging fruit and the clear possibility of major breakthroughs, which areharder to achieve when jumping on bandwagons.ReferencesBhattacharya, A., Pati, D., and Dunson, D.B. (2013). Anisotropic functionestimation using multi-bandwidth Gaussian processes. The Annals


290 Nonparametric Bayesof Statistics, inpress.Blackwell, D. and MacQueen, J. (1973). Ferguson distributions via Pólya urnschemes. The Annals of Statistics, 1:353–355.Chen, K., Jiang, W., and Tanner, M. (2010). A note on some algorithms forthe Gibbs posterior. Statistics & Probability Letters, 80:1234–1241.Chernozhukov, V. and Hong, H. (2003). An MCMC approach to classicalestimation. Journal of Econometrics, 115:293–346.de Jonge, R. and van Zanten, J. (2010). Adaptive nonparametric Bayesianinference using location-scale mixture priors. The Annals of Statistics,38:3300–3320.Dunson, D.B. and Park, J.-H. (2008). Kernel stick-breaking processes.Biometrika, 95:307–323.Dunson, D.B. and Taylor, J. (2005). Approximate Bayesian inference forquantiles. Journal of Nonparametric Statistics, 17:385–400.Ferguson, T.S. (1973). Bayesian analysis of some nonparametric problems.The Annals of Statistics, 1:209–230.Ferguson, T.S. (1974). Prior distributions on spaces of probability measures.The Annals of Statistics, 2:615–629.Genest, C. and Favre, A.-C. (2007). Everything you always wanted to knowabout copula modeling but were afraid to ask. Journal of Hydrologic Engineering,12:347–368.Genest, C. and Nešlehová, J. (2012). Copulas and copula models. Encyclopediaof Environmetrics, 2nd edition. Wiley, Chichester, 2:541–553.Griffiths, T. and Ghahramani, Z. (2011). The Indian buffet process: An introductionand review. Journal of Machine Learning Research, 12:1185–1224.Hoff, P. (2007). Extending the rank likelihood for semi parametric copulaestimation. The Annals of Applied Statistics, 1:265–283.Ishwaran, H. and James, L. (2001). Gibbs sampling methods for stickbreakingpriors. Journal of the American Statistical Association, 96:161–173.Jiang, W. and Tanner, M. (2008). Gibbs posterior for variable selection inhigh-dimensional classification and data mining. The Annals of Statistics,36:2207–2231.Lo, A. (1984). On a class of Bayesian nonparametric estimates. 1. densityestimates. The Annals of Statistics, 12:351–357.


D.B. Dunson 291Murray, J.S., Dunson, D.B., Carin, L., and Lucas, J.E. (2013). BayesianGaussian copula factor models for mixed data. Journal of the AmericanStatistical Association, 108:656–665.Owhadi, H., Scovel, C., and Sullivan, T. (2013). Bayesian brittleness: Whyno Bayesian model is “good enough.” arXiv:1304.6772 .Rivoirard, V. and Rousseau, J. (2012). Bernstein–von Mises theorem forlinear functionals of the density. The Annals of Statistics, 40:1489–1523.Rodríguez, A., Dunson, D.B., and Gelfand, A. (2008). The nested Dirichletprocess. Journal of the American Statistical Association, 103:1131–1144.Sethuraman, J. (1994). A constructive definition of Dirichlet priors. StatisticaSinica, 4:639–650.Teh, Y., Jordan, M., Beal, M., and Blei, D. (2006). Hierarchical Dirichlet processes.Journal of the American Statistical Association, 101:1566–1581.


26How do we choose our default methods?Andrew GelmanDepartment of StatisticsColumbia University, New YorkThe field of statistics continues to be divided into competing schools ofthought. In theory one might imagine choosing the uniquely best methodfor each problem as it arises, but in practice we choose for ourselves (andrecommend to others) default principles, models, and methods to be used in awide variety of settings. This chapter briefly considers the informal criteria weuse to decide what methods to use and what principles to apply in statisticsproblems.26.1 Statistics: The science of defaultsApplied statistics is sometimes concerned with one-of-a-kind problems, butstatistical methods are typically intended to be used in routine practice. Thisis recognized in classical theory (where statistical properties are evaluatedbased on their long-run frequency distributions) and in Bayesian statistics(averaging over the prior distribution). In computer science, machine learningalgorithms are compared using cross-validation on benchmark corpuses,which is another sort of reference distribution. With good data, a classicalprocedure should be robust and have good statistical properties under a widerange of frequency distributions, Bayesian inferences should be reasonableeven if averaging over alternative choices of prior distribution, and the relativeperformance of machine learning algorithms should not depend stronglyon the choice of corpus.How do we, as statisticians, decide what default methods to use? Here I amusing the term “method” broadly, to include general approaches to statistics(e.g., Bayesian, likelihood-based, or nonparametric) as well as more specificchoices of models (e.g., linear regression, splines, or Gaussian processes) andoptions within a model or method (e.g., model averaging, L 1 regularization,or hierarchical partial pooling). There are so many choices that it is hard to293


294 Choosing default methodsimagine any statistician carefully weighing the costs and benefits of each beforedeciding how to solve any given problem. In addition, given the existence ofmultiple competing approaches to statistical inference and decision making,we can deduce that no single method dominates the others.Sometimes the choice of statistical philosophy is decided by conventionor convenience. For example, I recently worked as a consultant on a legalcase involving audits of several random samples of financial records. I usedthe classical estimate ˆp = y/n with standard error √ˆp(1 − ˆp)/n, switchingto ˆp =(y + 2)/(n + 4) for cases where y = 0 or y = n. Thisprocedureissimple, gives reasonable estimates with good confidence coverage, and can bebacked up by a solid reference, namely Agresti and Coull (1998), which hasbeen cited over 1000 times according to Google Scholar. If we had been inasituationwithstrongpriorknowledgeontheprobabilitiesp, or interest indistinguishing between p = .99, .999, and .9999, it would have made senseto consider something closer to a full Bayesian approach, but in this settingit was enough to know that the probabilities were high, and so the simple(y + 2)/(n + 4) estimate (and associated standard error) was fine for our data,which included values such as y = n = 75.In many settings, however, we have freedom in deciding how to attack aproblem statistically. How then do we decide how to proceed?Schools of statistical thoughts are sometimes jokingly likened to religions.This analogy is not perfect — unlike religions, statistical methods have nosupernatural content and make essentially no demands on our personal lives.Looking at the comparison from the other direction, it is possible to be agnostic,atheistic, or simply live one’s life without religion, but it is not really possibleto do statistics without some philosophy. Even if you take a Tukeyesquestance and admit only data and data manipulations without reference to probabilitymodels, you still need some criteria to evaluate the methods that youchoose.One way in which schools of statistics are like religions is in how we endup affiliating with them. Based on informal observation, I would say thatstatisticians typically absorb the ambient philosophy of the institution wherethey are trained — or else, more rarely, they rebel against their training orpick up a philosophy later in their career or from some other source such asa persuasive book. Similarly, people in modern societies are free to choosetheir religious affiliation but it typically is the same as the religion of parentsand extended family. Philosophy, like religion but not (in general) ethnicity, issomething we are free to choose on our own, even if we do not usually take theopportunity to take that choice. Rather, it is common to exercise our free willin this setting by forming our own personal accommodation with the religionor philosophy bequeathed to us by our background.For example, I affiliated as a Bayesian after studying with Don Rubinand, over the decades, have evolved my own philosophy using his as a startingpoint. I did not go completely willingly into the Bayesian fold — the firststatistics course I took (before I came to Harvard) had a classical perspective,


A. Gelman 295and in the first course I took with Don, I continued to try to frame all theinferential problems into a Neyman–Pearson framework. But it didn’t take meor my fellow students long to slip into comfortable conformity.My views of Bayesian statistics have changed over the years — in particular,I have become much more fond of informative priors than I was duringthe writing of the first two editions of Bayesian Data Analysis (published 1995and 2004) — and I went through a period of disillusionment in 1991, whenI learned to my dismay that most of the Bayesians at the fabled Valenciameeting had no interest in checking the fit of their models. In fact, it wasa common view among Bayesians at the time that it was either impossible,inadvisable, or inappropriate to check the fit of a model to data. The ideawas that the prior distribution and the data model were subjective and thusuncheckable. To me, this attitude seemed silly — if a model is generated subjectively,that would seem to be more of a reason to check it — and sincethen my colleagues and I have expressed this argument in a series of papers;see, e.g., Gelman et al. (1996), and Gelman and Shalizi (2012). I am happyto say that the prevailing attitude among Bayesians has changed, with someembracing posterior predictive checks and others criticizing such tests for theirlow power (see, e.g., Bayarri and Castellanos, 2007). I do not agree with thatlatter view: I think it confuses different aspects of model checking; see Gelman(2007). On the plus side, however, it represents an acceptance of the idea thatBayesian models can be checked.But this is all a digression. The point I wanted to make here is that thedivision of statistics into parallel schools of thought, while unfortunate, hasits self-perpetuating aspects. In particular, I can communicate with fellowBayesians in a way that I sometimes have difficulty with others. For example,some Bayesians dislike posterior predictive checks, but non-Bayesians mostlyseem to ignore the idea — even though Xiao-Li Meng, Hal Stern, and I wroteour paper in general terms and originally thought our methods might appealmore strongly to non-Bayesians. After all, those statisticians were alreadyusing p-values to check model fit, so it seemed like a small step to average overa distribution. But this was a step that, by and large, only Bayesians wantedto take. The reception of this article was what convinced me to focus onreforming Bayesianism from the inside rather than trying to develop methodsone at a time that would make non-Bayesians happy.26.2 Ways of knowingHow do we decide to believe in the effectiveness of a statistical method? Hereare a few potential sources of evidence (I leave the list unnumbered so as notto imply any order of priority):


296 Choosing default methods(a) mathematical theory (e.g., coherence of inference or convergence);(b) computer simulations (e.g., demonstrating approximate coverage of intervalestimates under some range of deviations from an assumed model);(c) solutions to toy problems (e.g., comparing the partial pooling estimate forthe eight schools to the no pooling or complete pooling estimates);(d) improved performance on benchmark problems (e.g., getting better predictionsfor the Boston Housing Data);(e) cross-validation and external validation of predictions;(f) success as recognized in a field of application (e.g., our estimates of theincumbency advantage in congressional elections);(g) success in the marketplace (under the theory that if people are willing topay for something, it is likely to have something to offer).None of these is enough on its own. Theory and simulations are only as goodas their assumptions; results from toy problems and benchmarks don’t necessarilygeneralize to applications of interest; cross-validation and externalvalidation can work for some sorts of predictions but not others; and subjectmatterexperts and paying customers can be fooled.The very imperfections of each of these sorts of evidence gives a clue asto why it makes sense to care about all of them. We can’t know for sure so itmakes sense to have many ways of knowing.IdonotdeludemyselfthatthemethodsIpersonallypreferhavesomeabsolutestatus. The leading statisticians of the twentieth century were Neyman,Pearson, and Fisher. None of them used partial pooling or hierarchical models(well, maybe occasionally, but not much), and they did just fine. Meanwhile,other statisticians such as myself use hierarchical models to partially pool asa compromise between complete pooling and no pooling. It is a big world,big enough for Fisher to have success with his methods, Rubin to have successwith his, Efron to have success with his, and so forth. A few years ago(Gelman, 2010) I wrote of the methodological attribution problem:“The many useful contributions of a good statistical consultant, orcollaborator, will often be attributed to the statistician’s methods orphilosophy rather than to the artful efforts of the statistician himselfor herself. Don Rubin has told me that scientists are fundamentallyBayesian (even if they do not realize it), in that they interpret uncertaintyintervals Bayesianly. Brad Efron has talked vividly about howhis scientific collaborators find permutation tests and p-values to be themost convincing form of evidence. Judea Pearl assures me that graphicalmodels describe how people really think about causality. And soon. I am sure that all these accomplished researchers, and many more,are describing their experiences accurately. Rubin wielding a posterior


A. Gelman 297distribution is a powerful thing, as is Efron with a permutation test orPearl with a graphical model, and I believe that (a) all three can behelping people solve real scientific problems, and (b) it is natural fortheir collaborators to attribute some of these researchers’ creativity totheir methods.The result is that each of us tends to come away from a collaborationor consulting experience with the warm feeling that our methodsreally work, and that they represent how scientists really think. In statingthis, I am not trying to espouse some sort of empty pluralism —the claim that, for example, we would be doing just as well if we wereall using fuzzy sets, or correspondence analysis, or some other obscurestatistical method. There is certainly a reason that methodological advancesare made, and this reason is typically that existing methodshave their failings. Nonetheless, I think we all have to be careful aboutattributing too much from our collaborators’ and clients’ satisfactionwith our methods.”26.3 The pluralist’s dilemmaConsider the arguments made fifty years ago or so in favor of Bayesian inference.At that time, there were some applied successes (e.g., I.J. Good repeatedlyreferred to his successes using Bayesian methods to break codes inthe Second World War) but most of the arguments in favor of Bayes weretheoretical. To start with, it was (and remains) trivially (but not unimportantly)true that, conditional on the model, Bayesian inference gives the rightanswer. The whole discussion then shifts to whether the model is true, or, better,how the methods perform under the (essentially certain) condition thatthe model’s assumptions are violated, which leads into the tangle of varioustheorems about robustness or lack thereof.Forty or fifty years ago one of Bayesianism’s major assets was its mathematicalcoherence, with various theorems demonstrating that, under the rightassumptions, Bayesian inference is optimal. Bayesians also spent a lot of timewriting about toy problems, for example, Basu’s example of the weights ofelephants (Basu, 1971). From the other direction, classical statisticians feltthat Bayesians were idealistic and detached from reality.How things have changed! To me, the key turning points occurred around1970–80, when statisticians such as Lindley, Novick, Smith, Dempster, andRubin applied hierarchical Bayesian modeling to solve problems in educationresearch that could not be easily attacked otherwise. Meanwhile Box did similarwork in industrial experimentation and Efron and Morris connected theseapproaches to non-Bayesian theoretical ideas. The key in any case was to use


298 Choosing default methodspartial pooling to learn about groups for which there was only a small amountof local data.Lindley, Novick, and the others came at this problem in several ways.First, there was Bayesian theory. They realized that, rather than seeing certainaspects of Bayes (for example, the need to choose priors) as limitations, theycould see them as opportunities (priors can be estimated from data!) withthe next step folding this approach back into the Bayesian formalism viahierarchical modeling. We (the Bayesian community) are still doing researchon these ideas; see, for example, the recent paper by Polson and Scott (2012)on prior distributions for hierarchical scale parameters.The second way that the Bayesians of the 1970s succeeded was by applyingtheir methods on realistic problems. This is a pattern that has happened withjust about every successful statistical method I can think of: an interplaybetween theory and practice. Theory suggests an approach which is modifiedin application, or practical decisions suggest a new method which is thenstudied mathematically, and this process goes back and forth.To continue with the timeline: the modern success of Bayesian methodsis often attributed to our ability using methods such as the Gibbs samplerand Metropolis algorithm to fit an essentially unlimited variety of models:practitioners can use programs such as Stan to fit their own models, andresearchers can implement new models at the expense of some programmingbut without the need of continually developing new approximations and newtheory for each model. I think that’s right — Markov chain simulation methodsindeed allow us to get out of the pick-your-model-from-the-cookbook trap —but I think the hierarchical models of the 1970s (which were fit using variousapproximations, not MCMC) showed the way.Back 50 years ago, theoretical justifications were almost all that Bayesianstatisticians had to offer. But now that we have decades of applied successes,that is naturally what we point to. From the perspective of Bayesians such asmyself, theory is valuable (our Bayesian data analysis book is full of mathematicalderivations, each of which can be viewed if you’d like as a theoreticalguarantee that various procedures give correct inferences conditional on assumedmodels) but applications are particularly convincing. And applicationscan ultimately become good toy problems, once they have been smootheddown from years of teaching.Over the years I have become pluralistic in my attitudes toward statisticalmethods. Partly this comes from my understanding of the history describedabove. Bayesian inference seemed like a theoretical toy and was considered bymany leading statisticians as somewhere between a joke and a menace — seeGelman and Robert (2013) — but the hardcore Bayesians such as Lindley,Good, and Box persisted and got some useful methods out of it. To take amore recent example, the bootstrap idea of Efron (1979) is an idea that insome way is obviously wrong (as it assigns zero probability to data that didnot occur, which would seem to violate the most basic ideas of statistical


A. Gelman 299sampling) yet has become useful to many and has since been supported inmany cases by theory.In this discussion, I have the familiar problem that might be called the pluralist’sdilemma: how to recognize that my philosophy is just one among many,that my own embrace of this philosophy is contingent on many things beyondmy control, while still expressing the reasons why I believe this philosophy tobe preferable to the alternatives (at least for the problems I work on).One way out of the dilemma is to recognize that different methods areappropriate for different problems. It has been said that R.A. Fisher’s methodsand the associated 0.05 threshold for p-values worked particularly well forexperimental studies of large effects with relatively small samples — the sortsof problems that appear over and over again in books of Fisher, Snedecor,Cochran, and their contemporaries. That approach might not work so well insettings with observational data and sample sizes that vary over several ordersof magnitude. I will again quote myself (Gelman, 2010):“For another example of how different areas of application meritdifferent sorts of statistical thinking, consider Rob Kass’s remark: ‘I tellmy students in neurobiology that in claiming statistical significanceI get nervous unless the p-value is much smaller than .01.’ In politicalscience, we are typically not aiming for that level of uncertainty.(Just to get a sense of the scale of things, there have been barely 100national elections in all of US history, and political scientists studyingthe modern era typically start in 1946.)”Another answer is path dependence. Once you develop facility with a statisticalmethod, you become better at it. At least in the short term, I will be abetter statistician using methods with which I am already familiar. OccasionallyI will learn a new trick but only if forced to by circumstances. The samepattern can hold true with research: we are more equipped to make progressin a field along directions in which we are experienced and knowledgeable.Thus, Bayesian methods can be the most effective for me and my students,for the simple reason that we have already learned them.26.4 ConclusionsStatistics is a young science in which progress is being made in many areas.Some methods in common use are many decades or even centuries old, butrecent and current developments in nonparametric modeling, regularization,and multivariate analysis are central to state-of-the-art practice in many areasof applied statistics, ranging from psychometrics to genetics to predictivemodeling in business and social science. Practitioners have a wide variety ofstatistical approaches to choose from, and researchers have many potential


300 Choosing default methodsdirections to study. A casual and introspective review suggests that there aremany different criteria we use to decide that a statistical method is worthyof routine use. Those of us who lean on particular ways of knowing (whichmight include performance on benchmark problems, success in new applications,insight into toy problems, optimality as shown by simulation studies ormathematical proofs, or success in the marketplace) should remain aware ofthe relevance of all these dimensions in the spread of default procedures.ReferencesAgresti, A. and Coull, B.A. (1998). Approximate is better than exact forinterval estimation of binomial proportions. The American Statistician,52:119–126.Basu, D. (1971). An essay on the logical foundations of survey sampling,part 1 (with discussion). In Foundations of Statistical Inference (V.P.Godambe and D.A. Sprott, Eds.). Holt, Reinhart and Winston, Toronto,pp. 203–242.Box, G.E.P. and Tiao, G.C. (1973). Bayesian Inference in Statistical Analysis.Wiley,NewYork.Dempster, A.P., Rubin, D.B., and Tsutakawa, R.K. (1981). Estimation incovariance components models. Journal of the American Statistical Association,76:341–353.Efron, B. (1979). Bootstrap methods: Another look at the jackknife. TheAnnals of Statistics, 7:1–26.Efron, B. and Morris, C. (1975). Data analysis using Stein’s estimator and itsgeneralizations. Journal of the American Statistical Association, 70:311–319.Gelman, A. (2010). Bayesian statistics then and now. Discussion of “Thefuture of indirect evidence,” by Bradley Efron. Statistical Science,25:162–165.Gelman, A., Meng, X.-L., and Stern, H.S. (1996). Posterior predictive assessmentof model fitness via realized discrepancies (with discussion). StatisticaSinica, 6:733–807.Gelman, A. and Robert, C.P. (2013). “Not only defended but also applied”:The perceived absurdity of Bayesian inference (with discussion). TheAmerican Statistician, 67(1):1–5.


A. Gelman 301Gelman, A. and Shalizi, C. (2012). Philosophy and the practice of Bayesianstatistics (with discussion). British Journal of Mathematical and StatisticalPsychology, 66:8–38.Lindley, D.V. and Novick, M.R. (1981). The role of exchangeability in inference.The Annals of Statistics, 9:45–58.Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linearmodel. Journal of the Royal Statistical Society, Series B, 34:1–41.Polson, N.G. and Scott, J.G. (2012). On the half-Cauchy prior for a globalscale parameter. Bayesian Analysis, 7(2):1–16.Tukey, J.W. (1977). Exploratory Data Analysis. Addison-Wesley, Reading,MA.


27Serial correlation and Durbin–WatsonboundsT.W. AndersonDepartment of Economics and Department of StatisticsStanford University, Stanford, CAConsider the model y = Xβ+u,wherey is an n-vector of dependent variables,X is a matrix of n×k independent variables, and u is a n-vector of unobserveddisturbance. Let z = y−Xb,whereb is the least squares estimate of β.Thedstatistictests the hypothesis that the components of u are independent versusthe alternative that the components follow a Markov process. The Durbin–Watson bounds pertain to the distribution of the d-statistics.27.1 IntroductionA time series is composed of a sequence of observations y 1 ,...,y n ,wheretheindex i of the observation y i represents time. An important feature of a timeseries is the order of observations: y i is observed after y 1 ,...,y i−1 are observed.The correlation of successive observations is called a serial correlation.Related to each y i may be a vector of independent variables (x 1i ,...,x ki ).Many questions of time series analysis relate to the possible dependence of y ion x 1i ,...,x ki ; see, e.g., Anderson (1971).Aserialcorrelation(first-order)ofasequencey 1 ,...,y n isn∑ / ∑ny i y i−1 yi 2 .i=2This coefficient measures the correlation between y 1 ,...,y n−1 and y 2 ,...,y n .There are various modifications of this correlation coefficient such as replacingy i by y i − ȳ. Seebelowforthecircularserialcoefficient.Theterm“autocorrelation”is also used for serial correlation.I shall discuss two papers coauthored by James Durbin and Geoffrey Watsonentitled “Testing for serial correlation in least squares regression I and II,”i=1


304 Serial correlation and Durbin-Watson boundspublished in 1950 and 1951 respectively (Durbin and Watson, 1950, 1951).The statistical analysis developed in these papers has proved very useful ineconometric research.The Durbin–Watson papers are based on a model in which there is aset of “independent” variables x 1i ,...,x ni associated with each “dependent”variable y i for i ∈{1,...,n}. The dependent variable of y i is considered asthe linear combinationy i = β 1 x 1i + ···+ β R x Ri + w i ,i ∈{1,...,n},where w i is an unobservable random disturbance. The questions that Durbinand Watson address have to do with the possible dependence in a set ofobservations y 1 ,...,y n beyond what is explained by the independent variables.27.2 Circular serial correlationR.L. Anderson (Anderson, 1942), who was Watson’s thesis advisor, studiedthe statisticn∑n∑(y i − y i−1 ) 2y i y i−1i=1i=1n∑=2− 2 n∑,where y 0 = y n .Thestatisticyi2i=1n∑ / ∑ny i y i−1i=1i=1y 2 iyi2i=1is known as the “circular serial correlation coefficient.” Defining y 0 = y n isadevicetomakethemathematicssimpler.Theserialcorrelationcoefficientmeasures the relationship between the sequence y 1 ,...,y n and y 0 ,...,y n−1 .In our exposition we make repeated use of the fact that the distribution ofx ⊤ Ax is the distribution of ∑ ni=1 λ izi 2,whereλ 1,...,λ n are the characteristicroots (latent roots) of A, i.e., the roots of|A − λI n | =0, A = A ⊤ ,and x and z have the density N ( 0,σ 2 I ) . The numerator of the circular serialcorrelation is x ⊤ Ax, where⎡⎤0 1 0 ··· 1A = 1 1 0 1 ··· 00 1 0 ··· 0.2 ⎢⎣...⎥. ⎦1 0 0 ··· 0


T.W. Anderson 305The characteristic roots are λ j = cos 2πj/n and sin 2πj/n, j ∈{1,...,n}.If n is even, the roots occur in pairs. The distribution of the circular serialcorrelation is the distribution ofn∑ / ∑nλ j zj2 zj 2 , (27.1)j=1where z 1 ,...,z n are independent standard Normal variables. Anderson studiedthe distribution of the circular serial correlation, its moments, and otherproperties.j=127.3 Periodic trendsDuring World War II, R.L. Anderson and I were members of the PrincetonStatistical Research Group. We noticed that the jth characteristic vector ofA had the form cos 2πjh/n and/or sin 2πjh/n, h ∈{1,...,n}. Thesefunctionsare periodic and hence are suitable to represent seasonal variation. Weconsidered the modely i = β 1 x 1i + ···+ β k x ki + u i ,where x hi = cos 2πhi/n and/or sin 2πhi/n. Then the distribution of∑ (yi − ∑ β h x hi )(y i−1 − ∑ β h x h,i−1 )r = ∑ (yi − ∑ β h x hi ) 2is the distribution of (27.1), where the sums are over the z’s corresponding tothe cos and sin terms that did not occur in the trends. The distributions ofthe serial correlations have the same form as before.Anderson and Anderson (1950) found distributions of r for several cyclicaltrends as well as moments and approximate distributions.27.4 Uniformly most powerful testsAs described in Anderson (1948), many problems of serial correlation areincluded in the general model[K exp − α {}](y − µ) ⊤ Ψ (y − µ)+λ (y − µ) ⊤ Θ (y − µ) ,2where K is a constant, α>0, Ψ a given positive definite matrix, Θ a givensymmetric matrix, λ aparametersuchthatΨ − λΘ is positive definite, and


306 Serial correlation and Durbin-Watson boundsµ is the expectation of y,Ey = µ = ∑ β j φ j .We shall consider testing the hypothesisH : λ =0.The first theorem characterizes tests such that the probability of the acceptanceregion when λ = 0 does not depend on the values of β 1 ,...,β k . Thesecond theorem gives conditions for a test being uniformly most powerful whenλ>0isthealternative.These theorems are applicable to the circular serial correlation when Ψ =σ 2 I and Θ = σ 2 A defined above.The equation∑(yi − y i−1 ) 2 = ∑ (y2i + yi−12 ) ∑− 2 yt y t−1suggests that a serial correlation can be studied in terms of ∑ (y t − y t−1 ) 2which may be suitable to test that y 1 ,...,y n are independent against the alternativethat y 1 ,...,y n satisfy an autoregressive process. Durbin and Watsonprefer to studyd = ∑ (z i − z i−1 ) 2/ ∑z2i ,where z is defined below.27.5 Durbin–WatsonThe model isyn×1 = Xβn×k k×1+ un×1.We consider testing the null hypothesis that u has a Normal distribution withmean 0 and covariance σ 2 I n against the alternative that u has a Normaldistribution with mean 0 and covariance σ 2 A, a positive definite matrix. Thesample regression of y is b =(X ⊤ X) −1 X ⊤ y and the vector of residuals isz = y − Xb = {I − X(X ⊤ X) −1 X ⊤ }y= {I − X(X ⊤ X) −1 X ⊤ }(Xβ + u)= Mu,whereM = I − X(X ⊤ X) −1 X ⊤ .


T.W. Anderson 307Consider the serial correlation of the residualsr = z⊤ Azz ⊤ z= u⊤ M ⊤ AMuu ⊤ M ⊤ Mu .The matrix M is idempotent, i.e., M m = M, and symmetric. Its latent rootsare 0 and 1 and it has rank n − k. Let the possibly nonzero roots of M ⊤ AMbe ν 1 ,...,ν n−k . There is an n × (n − k) matrixH such that H ⊤ H = I n−kand⎡⎤ν 1 0 ··· 0H ⊤ M ⊤ 0 ν 2 ··· 0AMH = ⎢ . . . . ⎥⎣ . . . . ⎦ .0 0 ··· ν n−kLet w = H ⊤ v.Thenn−k∑r = ν j wj2j=1Durbin and Watson prove that∑wj 2 ./n−kj=1λ j ≤ ν j ≤ λ j+k ,j ∈{1,...,n− k}.Definen−k∑r L = λ j wj2∑n−k∑wj 2 , r U = λ j+k wj2/n−k∑wj 2 ./n−kj=1 j=1j=1j=1Then r L ≤ r ≤ r U .The “bounds procedure” is the following. If the observed serial correlationis greater than rU ⋆ conclude that the hypothesis of no serial correlation of thedisturbances is rejected. If the observed correlation is less than rL ⋆ ,concludethat the hypothesis of no serial correlation of the disturbance is accepted.The interval (rL ⋆ ,r⋆ U ) is called “the zone of indeterminacy.” If the observedcorrelation falls in the interval (rL ⋆ ,r⋆ U ), the data are considered as not leadingto a conclusion.ReferencesAnderson, R.L. (1942). Distribution of the serial correlation coefficient. TheAnnals of Mathematical Statistics, 13:1–13.Anderson, R.L. and Anderson, T.W. (1950). Distribution of the circular serialcorrelation coefficient for residuals from a fitted Fourier series. The Annalsof Mathematical Statistics, 21:59–81.


308 Serial correlation and Durbin-Watson boundsAnderson, T.W. (1948). On the theory of testing serial correlation. SkandinaviskAktuarietidskrift, 31:88–116.Anderson, T.W. (1971). The Statistical Analysis of Time Series. Wiley,NewYork.Anderson, T.W. (2003). An Introduction to Multivariate Statistical Analysis,3rd edition. Wiley, New York.Chipman, J.S. (2011). Advanced Econometric Theory. Routledge,London.Durbin, J. and Watson, G.S. (1950). Testing for serial correlation in leastsquares regression. I. Biometrika, 37:409–428.Durbin, J. and Watson, G.S. (1951). Testing for serial correlation in leastsquares regression. II. Biometrika, 38:159–178.


28Anon-asymptoticwalkinprobabilityandstatisticsPascal MassartDépartement de mathématiquesUniversité deParis-Sud,Orsay,FranceMy research is devoted to the derivation of non-asymptotic results in probabilityand statistics. Basically, this is a question of personal taste: I have beenstruggling with constants in probability bounds since the very beginning of mycareer. I was very lucky to learn from the elegant work of Michel Talagrandthat the dream of a non-asymptotic theory of independence could actuallybecome reality. Thanks to my long-term collaboration with my colleague andfriend Lucien Birgé, I could realize the importance of a non-asymptotic approachto statistics. This led me to follow a singular path, back and forthbetween concentration inequalities and model selection, that I briefly describebelow in this (informal) paper for the 50th anniversary of the COPSS.28.1 IntroductionThe interest in non-asymptotic tail bounds for functions of independent randomvariables is rather recent in probability theory. Apart from sums, whichhave been well understood for a long time, powerful tools for handling moregeneral functions of independent random variables were not introduced beforethe 1970s. The systematic study of concentration inequalities aims at boundingthe probability that such a function differs from its expectation or itsmedian by more than a given amount. It emerged from a remarkable series ofpapers by Michel Talagrand in the mid-1990s.Talagrand provided a major new insight into the problem, around the ideasummarized in Talagrand (1995): “A random variable that smoothly dependson the influence of many independent random variables satisfies Chernoff typebounds.” This revolutionary approach opened new directions of research andstimulated numerous applications in various fields such as discrete mathemat-309


310 Anon-asymptoticwalkics, statistical mechanics, random matrix theory, high-dimensional geometry,and statistics.The study of random fluctuations of suprema of empirical processes hasbeen crucial in the application of concentration inequalities to statistics andmachine learning. It also turned out to be a driving force behind the developmentof the theory. This is exactly what I would like to illustrate here whilefocusing on the impact on my own research in the 1990s and beyond.28.2 Model selectionModel selection is a classical topic in statistics. The idea of selecting a model bypenalizing some empirical criterion goes back to the early 1970s with the pioneeringwork of Mallows and Akaike. The classical parametric view on modelselection as exposed in Akaike’s seminal paper (Akaike, 1973) on penalizedlog-likelihood is asymptotic in essence. More precisely Akaike’s formula forthe penalty depends on Wilks’ theorem, i.e., on an asymptotic expansion ofthe log-likelihood.Lucien Birgé and I started to work on model selection criteria based on anon-asymptotic penalized log-likelihood early in the 1990s. We had in mindthat in the usual asymptotic approach to model selection, it is often unrealisticto assume that the number of observations tends to infinity while the list ofmodels and their size are fixed. Either the number of observations is not thatlarge (a hundred, say) and when playing with models with a moderate numberof parameters (five or six) you cannot be sure that asymptotic results apply, orthe number of observations is really large (as in signal de-noising, for instance)and you would like to take advantage of it by considering a potentially largelist of models involving possibly large numbers of parameters.From a non-asymptotic perspective, the number of observations and thelist of models are what they are. The purpose of an ideal model selectionprocedure is to provide a data-driven choice of model that tends to optimizesome criterion, e.g., minimum expected risk with respect to the quadraticloss or the Kullback–Leibler loss. This provides a well-defined mathematicalformalization of the model selection problem but it leaves open the search fora neat generic solution.Fortunately for me, the early 1990s turned out to be a rich period forthe development of mathematical statistics, and I came across the idea thatletting the size of models go to infinity with the number of observations makesit possible to build adaptive nonparametric estimators. This idea can be tracedback to at least two different sources: information theory and signal analysis.In particular, Lucien and I were very impressed by the beautiful paperof Andrew Barron and Tom Cover (Barron and Cover, 1991) on density estimationvia minimum complexity model selection. The main message there


P. Massart 311(at least for discrete models) is that if you allow model complexity to growwith sample size, you can then use minimum complexity penalization to buildnonparametric estimators of a density which adapts to the smoothness.Meanwhile, David Donoho, Iain Johnstone, Gérard Kerkyacharian and DominiquePicard were developing their approach to wavelet estimation. Theirstriking work showed that in a variety of problems, it is possible to build adaptiveestimators of a regression function or a density through a remarkably simpleprocedure: thresholding of the empirical wavelet coefficients. Many paperscould be cited here but Donoho et al. (1995) is possibly the mostly useful reviewon the topic. Wavelet thresholding has an obvious model selection flavorto it, as it amounts to selecting a set of wavelet coefficients from the data.At some point, it became clear to us that there was room for building ageneral theory to help reconcile Akaike’s classical approach to model selection,the emerging results by Barron and Cover or Donoho et al. in which modelselection is used to construct nonparametric adaptive estimators, and Vapnik’sstructural minimization of the risk approach for statistical learning; see Vapnik(1982).28.2.1 The model choice paradigmAssume that a random variable ξ (n) is observed which depends on a parametern. Forconcreteness,youmaythinkofξ (n) as an n-sample from some unknowndistribution. Consider the problem of estimating some quantity of interest, s,which is known to belong to some (large) set S. Consider an empirical riskcriterion γ n based on ξ (n) such that the mappingt ↦→ E {γ n (t)}achieves a minimum at point s. Onecanthendefineanatural(nonnegative)loss function related to this criterion by setting, for all t ∈S,l (s, t) =E{γ n (t)}−E {γ n (s)} .When ξ (n) =(ξ 1 ,...,ξ n ), the empirical risk criterion γ n is usually defined assome empirical meanγ n (t) =P n {γ (t, ·)} = 1 nn∑γ (t, ξ i ) (28.1)of an adequate risk function γ. Two typical examples are as follows.Example 1 (Density estimation) Let ξ 1 ,...,ξ n be a random sample froman unknown density s with respect to a given measure µ. Taking γ(t, x) =− ln{t(x)} in (28.1) leads to the log-likelihood criterion. The correspondingloss function, l, is simply the Kullback–Leibler information between the probabilitymeasures sµ and tµ. Indeed, l(s, t) = ∫ s ln(s/t)dµ if sµ is absolutelycontinuous with respect to tµ and l(s, t) =∞ otherwise. However ifi=1


312 Anon-asymptoticwalkγ(t, x) =‖t‖ 2 −2t(x) in (28.1), where ‖·‖ denotes the norm in L 2 (µ), onegetsthe least squares criterion and the loss function is given by l(s, t) =‖s − t‖ 2for every t ∈ L 2 (µ).Example 2 (Gaussian white noise) Consider the process ξ (n) on [0, 1] definedby dξ (n) (x) = s (x) +n −1/2 dW (x) with ξ (n) (0) = 0, where W denotesthe Brownian motion. The least squares criterion is defined by γ n (t) =‖t‖ 2 − 2 ∫ 10 t (x)dξ(n) (x), and the corresponding loss function l is simply thesquared L 2 distance defined for all s, t ∈ [0, 1] by l(s, t) =‖s − t‖ 2 .Given a model S (which is a subset of S), the empirical risk minimizer issimply defined as a minimizer of γ n over S. Itisanaturalestimatorofs whosequality is directly linked to that of the model S. Thequestionisthen:Howcanone choose a suitable model S? It would be tempting to choose S as large aspossible. Taking S as S itself or as a “big” subset of S is known to lead eitherto inconsistent estimators (Bahadur, 1958) or to suboptimal estimators (Birgéand Massart, 1993). In contrast if S is a “small” model (e.g., some parametricmodel involving one or two parameters), the behavior of the empirical riskminimizer on S is satisfactory as long as s is close enough to S, butthemodelcan easily end up being completely wrong.One of the ideas suggested by Akaike is to use the risk associated to the lossfunction l as a quality criterion for a model. To illustrate this idea, it is convenientto consider a simple example for which everything is easily computable.Consider the white noise framework. If S is a linear space with dimensionD, and if φ 1 ,...,φ D denotes some orthonormal basis of S, theleastsquaresestimator is merely a projection estimator, viz.D∑{∫ 1}ŝ = φ j (x)dξ (n) (x)j=1and the expected quadratic risk of ŝ is equal to0φ jE(‖s − ŝ‖ 2 )=d 2 (s, S)+D/n.This formula for the quadratic risk reflects perfectly the model choiceparadigm: if the model is to be chosen in such a way that the risk of theresulting least squares estimator remains under control, a balance must bestruck between the bias term d 2 (s, S) and the variance term D/n.More generally, given an empirical risk criterion γ n , each model S m inan (at most countable and usually finite) collection {S m : m ∈M}can berepresented by the corresponding empirical risk minimizer ŝ m .Onecanusetheminimum of E {l (s, ŝ m )} over M as a benchmark for model selection. Ideallyone would like to choose m (s) so as to minimize the risk E {l (s, ŝ m )} withrespect to m ∈M. This is what Donoho and Johnstone called an oracle; see,e.g., Donoho and Johnstone (1994). The purpose of model selection is to designadata-drivenchoice ˆm which mimics an oracle, in the sense that the risk of theselected estimator ŝ ˆm is not too far from the benchmark inf m∈M E {l (s, ŝ m )}.


P. Massart 31328.2.2 Non-asymptopiaThe penalized empirical risk model selection procedure consists in consideringan appropriate penalty function pen: M→R + and choosing ˆm to minimizecrit (m) =γ n (ŝ m )+pen(m)over M. OnecanthendefinetheselectedmodelS ̂m and the penalized empiricalrisk estimator ŝ ˆm .Akaike’s penalized log-likelihood criterion corresponds to the case wherethe penalty is taken as D m /n, whereD m denotes the number of parametersdefining the regular parametric model S m . As mentioned above, Akaike’sheuristics heavily relies on the assumption that the dimension and the numberof models are bounded with respect to n, as n →∞.Variouspenalizedcriteria have been designed according to this asymptotic philosophy; see, e.g.,Daniel and Wood (1971).In contrast a non-asymptotic approach to model selection allows both thenumber of models and the number of their parameters to depend on n. Onecan then choose a list of models which is suitable for approximation purposes,e.g., wavelet expansions, trigonometric or piecewise polynomials, or artificialneural networks. For example, the hard thresholding procedure turns out tobe a penalized empirical risk procedure if the list of models depends on n.To be specific, consider once again the white noise framework and consideran orthonormal system φ 1 ,...,φ n of L 2 [0, 1] that depends on n. Foreverysubset m of {1,...,n}, definethemodelS m as the linear span of {φ j : j ∈ m}.The complete variable selection problem requires the selection of a subset mfrom the collection of all subsets of {1,...,n}.Takingapenaltyfunctionoftheform pen (m) =T 2 |m| /n leads to an explicit solution for the minimizationof crit(m) becauseinthiscase,setting ˆβ j = ∫ φ j (x)dξ (n) (x), the penalizedempirical criterion can be written ascrit (m) =− ∑ j∈mˆβ 2 j + T 2 |m|nThis criterion is obviously minimized at= ∑ j∈m(− ˆβ 2 j + T 2ˆm = {j ∈{1,...,n} : √ n | ˆβ j |≥T },which is precisely the hard thresholding procedure. Of course the crucial issueis to choose the level of thresholding, T .More generally the question is: what kind of penalty should be recommendedfrom a non-asymptotic perspective? The naive notion that Akaike’scriterion could be used in this context fails in the sense that it may typicallylead to under-penalization. In the preceding example, it would lead to thechoice T = √ 2 while it stems from the work of Donoho et al. that the level ofthresholding should be at least of order √ 2ln(n) as n →∞.n).


314 Anon-asymptoticwalk28.2.3 Empirical processes to the rescueThe reason for which empirical processes have something to do with the analysisof penalized model selection procedures is roughly the following; see Massart(2007) for further details. Consider the centered empirical risk process¯γ n (t) =γ n (t) − E{γ n (t)}.Minimizing crit(m) isthenequivalenttominimizingcrit(m) − γ n (s) =l(s, ŝ m ) −{¯γ n (s) − ¯γ n (ŝ m )} +pen(m).One can readily see from this formula that to mimic an oracle, the penaltypen(m) should ideally be of the same order of magnitude as ¯γ n (s) − ¯γ n (ŝ m ).Guessing what is the exact order of magnitude for ¯γ n (s) − ¯γ n (ŝ m ) is not aneasy task in general, but one can at least try to compare the fluctuations of¯γ n (s)−¯γ n (ŝ m )tothequantityofinterest,l (s, ŝ m ). To do so, one can introducethe supremum of the weighted process¯γ n (s) − ¯γ n (t)Z m = supt∈S mw {l (s, t)} ,where w is a conveniently chosen non-decreasing weight function. For instanceif w {l (s, t)} =2 √ l (s, t) then,foreveryθ>0,l (s, ŝ m ) −{¯γ n (s) − ¯γ n (ŝ m )}≥(1 − θ) l (s, ŝ m ) − Z 2 m/θ.Thus by choosing pen(m) in such a way that Zm2 ≤ θ pen(m) (with highprobability), one can hope to compare the model selection procedure with theoracle.We are at the very point where the theory of empirical processes comesin, because the problem is now to control the quantity Z m ,whichisindeedthe supremum of an empirical process — at least when the empirical risk isdefined through (28.1). Lucien Birgé and I first used this idea in 1994, whilepreparing our contribution to the Festschrift for Lucien Le Cam to markhis 70th birthday. The corresponding paper, Birgé and Massart (1997), waspublished later and we generalized it in Barron et al. (1999).In the context of least squares density estimation that we were investigating,the weight function w to be considered is precisely of the formw (x) = 2 √ x.ThusifthemodelS m happens to be an finite-dimensionalsubspace of L 2 (µ) generated by some orthonormal basis {φ λ : λ ∈ Λ m },thequantity of interest¯γ n (s) − ¯γ n (t)Z m = sup(28.2)t∈S m2 ‖s − t‖can easily be made explicit. Indeed, assuming that s belongs to S m (this assumptionis not really needed but makes the analysis much more illuminating),√ ∑Z m = (P n − P ) 2 (φ λ ), (28.3)λ∈Λ m


P. Massart 315where P = sµ. In other words, Z m is simply the square root of a χ 2 -typestatistic. Deriving exponential bounds for Z m from (28.3) does not look especiallyeasier than starting from (28.2). However, it is clear from (28.3) thatE(Z 2 m) can be computed explicitly. As a result one can easily bound E(Z m )using Jensen’s inequality, viz.E(Z m ) ≤ √ E(Z 2 m).By shifting to concentration inequalities, we thus hoped to escape the heavyduty“chaining” machinery, which was the main tool available at that time tocontrol suprema of empirical processes.It is important to understand that this is not merely a question of taste orelegance. The disadvantage with chaining inequalities is that even if you optimizethem, at the end of the day the best you can hope for is to derive a boundwith the right order of magnitude; the associated numerical constants are typicallyridiculously large. When the goal is to validate (or invalidate) penalizedcriteria such as Mallows’ or Akaike’s criterion from a non-asymptotic perspective,constants do matter. This motivated my investigation of the fascinatingtopic of concentration inequalities.28.3 Welcome to Talagrand’s wonderlandMotivated by the need to understand whether one can derive concentrationinequalities for suprema of empirical processes, I intensified my readings onthe concentration of measures. By suprema of empirical processes, I meanZ =supt∈Tn∑X i,t ,where T is a set and X 1,t ,...,X n,t are mutually independent random vectorstaking values in R T .For applications such as that which was described above, it is important tocover the case where T is infinite, but for the purpose of building structuralinequalities like concentration inequalities, T finite is in fact the only casethat matters because one can recover the general case from the finite case byapplying monotone limit procedures, i.e., letting the size of the index set growto infinity. Henceforth I will thus assume the set T to be finite.When I started investigating the issue in 1994, the literature was dominatedby the Gaussian Concentration Theorem for Lipschitz functions of independentstandard Gaussian random variables. This result was proved independentlyby Borell (1975) and by Cirel’son and Sudakov (1974). As a sideremark, note that these authors actually established the concentration of Lipschitzfunctions around the median; the analogous result for the mean is duei=1


316 Anon-asymptoticwalkto Cirel’son et al. (1976). At any rate, it seemed to me to be somewhat ofa beautiful but isolated mountain, given the abundance of results by MichelTalagrand on the concentration of product measures. In the context of empiricalprocesses, the Gaussian Concentration Theorem implies the followingspectacular result.Assume that X 1,t ,...,X n,t are Gaussian random vectors centered at theirexpectation. Let v be the maximal variance of X 1,t + ···+ X n,t when t varies,and use M to denote either the median or the mean of Z. ThenPr (Z ≥ M + z) ≤ exp(− z22vIn the non-Gaussian case, the problem becomes much more complex. Oneof Talagrand’s major achievements on the topic of concentration inequalitiesfor functions on a product space X = X 1 ×···×X n is his celebrated convexdistance inequality. Given any vector α =(α 1 ,...,α n ) of non-negativereal numbers and any (x, y) ∈X×X,theweightedHammingdistanced α isdefined byn∑d α (x, y) = α i 1(x i ≠ y i ),i=1where 1(A) denotes the indicator of the set A. Talagrand’sconvexdistancefrom a point x to some measurable subset A of X is then defined byd T (x, A) = sup|α| 2 2 ≤1 d α (x, A),where |α| 2 2 = α1 2 + ···+ αn.2If P denotes some product probability measure P = µ 1 ⊗···⊗µ n on X ,the concentration of P with respect to d T is specified by Talagrand’s convexdistance inequality, which ensures that for any measurable set A, one hasP {d T (·,A) ≥ z} ≤P (A)exp).(− z24). (28.4)Typically, it allows the analysis of functions that satisfy the regularity conditionn∑f(x) − f(y) ≤ α i (x)1(x i ≠ y i ). (28.5)i=1One can then play the following simple but subtle game. Choose A = {f ≤M} and observe that in view of condition (28.5), one has f(x)


P. Massart 317Hence by Talagrand’s convex distance inequality (28.4), one gets)P (f ≥ M + z) ≤ 2exp(− z24v(28.6)whenever M is a median of f under P . The preceding inequality applies to aRademacher process, which is a special case of an empirical process. Indeed,setting X = {−1, 1} n and definingf (x) =supt∈Tn∑α i,t x i =i=1n∑i=1α i,t ∗ (x)x iin terms of real numbers α i,t ,onecanseethat,foreveryx and y,n∑n∑ ∣f(x) − f(y) ≤ α i,t ∗ (x)(x i − y i ) ≤ 2∣ 1(x i ≠ y i ).i=1i=1∣α i,t ∗ (x)This means that the function f satisfies the regularity condition (28.5) withα i (x) = 2|α i,t ∗ (x)|. ThusifX = (X 1,t ,...,X n,t )isuniformlydistributedon the hypercube {−1, 1} n ,itfollowsfrom(28.6)thatthesupremumoftheRademacher processZ =supt∈Tn∑α i,t X i,t = f(X)i=1satisfies the sub-Gaussian tail inequalityPr(Z ≥ M + z) ≤ 2exp) (− z2,4vwhere the variance factor v can be taken as v =4sup t∈T (α 2 1,t + ···+ α 2 n,t).This illustrates the power of Talagrand’s convex distance inequality. Alas,while condition (28.5) is perfectly suited for the analysis of Rademacher processes,it does not carry over to more general empirical processes.At first, I found it a bit frustrating that there was no analogue of theGaussian concentration inequality for more general empirical processes andthat Talagrand’s beautiful results were seemingly of no use for dealing withsuprema of empirical processes like (28.2). Upon reading Talagrand (1994)carefully, however, I realized that it contained at least one encouraging result.Namely, Talagrand (1994) proved a sub-Gaussian Bernstein type inequalityfor Z − C E(Z), where C is a universal constant. Of course in Talagrand’sversion, C is not necessarily equal to 1 but it was reasonable to expect thatthis should be the case. This is exactly what Lucien Birgé and I were ableto show. We presented our result at the 1994 workshop organized at Yale inhonor of Lucien Le Cam. A year later or so, I was pleased to hear from MichelTalagrand that, motivated in part by the statistical issues described above,and at the price of some substantial deepening of his approach to concentrationof product measures, he could solve the problem and obtain his now famousconcentration inequality for the supremum of a bounded empirical process;see Talagrand (1996).


318 Anon-asymptoticwalk28.4 Beyond Talagrand’s inequalityTalagrand’s new result for empirical processes stimulated intense research,part of which was aimed at deriving alternatives to Talagrand’s original approach.The interested reader will find in Boucheron et al. (2013) an accountof the transportation method and of the so-called entropy method that wedeveloped in a series of papers (Boucheron et al., 2000, 2003, 2005; Massart,2000) in the footsteps of Michel Ledoux (1996). In particular, using the entropymethod established by Olivier Bousquet (2002), we derived a versionof Talagrand’s inequality for empirical processes with optimal numerical constantsin the exponential bound.Model selection issues are still posing interesting challenges for empiricalprocess theory. In particular, the implementation of non-asymptotic penalizationmethods requires data-driven penalty choice strategies. One possibilityis to use the concept of “minimal penalty” that Lucien Birgé and I introducedin Birgé and Massart (2007) in the context of Gaussian model selectionand, more generally, the “slope heuristics” (Arlot and Massart, 2009), whichbasically relies on the idea that the empirical lossγ n (s) − γ n (ŝ m )= supt∈S m{γ n (s) − γ n (t)}has a typical behavior for large dimensional models. A complete theoreticalvalidation of these heuristics is yet to be developed but partial results areavailable; see, e.g., Arlot and Massart (2009), Birgé andMassart(2007),andSaumard (2013).A fairly general concentration inequality providing a non-asymptotic analogueto Wilks’ Theorem is also established in Boucheron and Massart (2011)and used in Arlot and Massart (2009). This result stems from the entropymethod, which is flexible enough to capture the following rather subtle selflocalizationeffect. The variance of sup t∈Sm {γ n (s) − γ n (t)} can be proved tobe of the order of the variance of γ n (s) − γ n (t) at t = ŝ m ,whichmaybemuch smaller than the maximal variance. This is typically the quantity thatwould emerge from a direct application of Talagrand’s inequality for empiricalprocesses.The issue of calibrating model selection criteria from data is of great importance.In the context where the list of models itself is data dependent(think, e.g., of models generated by variables selected from an algorithm suchas LARS), the problem is related to the equally important issue of choosingregularization parameters; see Meynet (2012) for more details. This is anew field of investigation which is interesting both from a theoretical and apractical point of view.


P. Massart 319ReferencesAkaike, H. (1973). Information theory and an extension of the maximumlikelihood principle. Proceedings of the Second International Symposiumon Information Theory. Akademia Kiado, Budapest, pp. 267–281.Arlot, S. and Massart, P. (2009). Data driven calibration of penalties for leastsquares regression. Journal of Machine Learning Research, 10:245–279.Bahadur, R.R. (1958). Examples of inconsistency of maximum likelihoodestimates. Sankhyā, SeriesA,20:207–210.Barron, A.R., Birgé, L., and Massart, P. (1999). Risk bounds for modelselection via penalization. Probability Theory and Related Fields,113:301–413.Barron, A.R. and Cover, T.M. (1991). Minimum complexity density estimation.IEEE Transactions on Information Theory, 37:1034–1054.Birgé, L. and Massart, P. (1993). Rates of convergence for minimum contrastestimators. Probability Theory and Related Fields, 97:113–150.Birgé, L. and Massart, P. (1997). From model selection to adaptive estimation.In Festschrift for Lucien Le Cam: Research Papers in Probabilityand Statistics (D. Pollard, E. Torgersen, and G. Yang, Eds.). Springer,New York, pp. 55–87.Birgé, L. and Massart, P. (2007). Minimal penalties for Gaussian model selection.Probability Theory and Related Fields, 138:33–73.Borell, C. (1975). The Brunn–Minkowski inequality in Gauss space. InventionesMathematicae, 30:207–216.Boucheron, S., Bousquet, O., Lugosi, G., and Massart, P. (2005). Momentinequalities for functions of independent random variables. The Annalsof Probability, 33:514–560.Boucheron, S., Lugosi, G., and Massart, P. (2000). A sharp concentrationinequality with applications. Random Structures and Algorithms, 16:277–292.Boucheron, S., Lugosi, G., and Massart, P. (2003). Concentration inequalitiesusing the entropy method. The Annals of Probability, 31:1583–1614.Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration Inequalities:A Nonasymptotic Theory of Independence. Oxford University Press,Oxford, UK.


320 Anon-asymptoticwalkBoucheron, S. and Massart, P. (2011). A high-dimensional Wilks phenomenon.Probability Theory and Related Fields, 150:405–433.Bousquet, O. (2002). A Bennett concentration inequality and its applicationto suprema of empirical processes. Comptes rendus mathématiques del’Académie des sciences de Paris, 334:495–500.Cirel’son, B.S., Ibragimov, I.A., and Sudakov, V.N. (1976). Norm of Gaussiansample function. In Proceedings of the 3rd Japan–USSR Symposium onProbability Theory, Lecture Notes in Mathematics 550. Springer, Berlin,pp. 20–41.Cirel’son, B.S. and Sudakov, V.N. (1974). Extremal properties of half spacesfor spherically invariant measures. Journal of Soviet Mathematics 9, 9–18 (1978) [Translated from Zapiski Nauchnykh Seminarov LeningradskogoOtdeleniya Matematicheskogo Instituta im. V.A. Steklova AN SSSR,41:14–24 (1974)].Daniel, C. and Wood, F.S. (1971). Fitting Equations to Data. Wiley,NewYork.Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by waveletshrinkage. Biometrika, 81:425–455.Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995).Wavelet shrinkage: Asymptopia? (with discussion). Journal of the RoyalStatistical Society, Series B, 57:301–369.Ledoux, M. (1996). On Talagrand deviation inequalities for product measures.ESAIM: Probability and Statistics, 1:63–87.Massart, P. (2000). About the constants in Talagrand’s concentration inequalitiesfor empirical processes. The Annals of Probability, 28:863–884.Massart, P. (2007). Concentration Inequalities and Model Selection. Écoled’été deprobabilités de Saint-Flour 2003. Lecture Notes in Mathematics1896, Springer, Berlin.Meynet, C. (2012). Sélection de variables pour la classification non superviséeen grande dimension. Doctoral dissertation, Université Paris-SudXI,Orsay, France.Saumard, A. (2013). Optimal model selection in heteroscedastic regressionusing piecewise polynomial functions. Electronic Journal of Statistics,7:1184–1223.Talagrand, M. (1994). Sharper bounds for empirical processes. The Annalsof Probability, 22:28–76.


P. Massart 321Talagrand, M. (1995). Concentration of measure and isoperimetric inequalitiesin product spaces. Publications mathématiques de l’Institut des hautesétudes supérieures, 81:73–205.Talagrand, M. (1996). New concentration inequalities in product spaces. InventionesMathematicae, 126:505–563.Vapnik, V.N. (1982). Estimation of Dependencies Based on Empirical Data.Springer, New York.


29The past’s future is now: What will thepresent’s future bring?Lynne BillardDepartment of StatisticsUniversity of Georgia, Athens, GAArticles published in the early years of the Journal of the American StatisticalAssociation, i.e.,1888–1910s,positednewtheoriesmostlybyusingarithmeticalarguments. Starting around the mid-1910s the arguments became algebraicin nature and by the 1920s this trend was well established. Today, a centurylater, in addition to cogent mathematical arguments, new statistical developmentsare becoming computational, such is the power and influence of themodern computer (a device un-dreamed of in those earlier days). Likewise,we see enormous changes in the size and nature of assembled data sets forour study. Therefore, entirely new paradigms are entering our discipline, radicallychanging the way we go about our art. This chapter focuses on onesuch method wherein the data are symbolically valued, i.e., hypercubes inp-dimensional space R p , instead of the classically valued points in R p .29.1 IntroductionThe advent of modern computer capabilities has a consequence that entirelynew paradigms are entering our discipline radically changing the way we goabout our art. One hundred years ago, researchers were transitioning from usingarithmetical arguments when developing their new mathematically-basedideas to using algebraic arguments (i.e., mathematical tools, algebra, calculus,and the like). Today’s transition lies more along the lines of computationalmathematical/statistical developments as we struggle with the massively hugedata sets at hand. This chapter focuses on one such method — symbolic data— projected by Goodman (2011) as one of the two most important new developmentson the horizon wherein the data are symbolically valued, i.e., hypercubesin p-dimensional space R p , instead of points in R p as for classical323


324 The present’s futuredata. In Section 29.2, we describe briefly what symbolic data are and howthey might arise. Then, in Section 29.3, we illustrate some symbolic methodologicalanalyses and compare the results with those obtained when usingclassical surrogates. Some concluding remarks about the future of such dataare presented in Section 29.4.29.2 Symbolic dataSymbolic data consist of lists, intervals, histograms and the like, and arisein two broadly defined ways. One avenue is when data sets of classical pointobservations are aggregated into smaller data sets. For example, consider alarge medical data set of millions of individual observations with informationsuch as demographic (e.g., age, gender, etc.), geographical (e.g., townof residence, country, region, ...), basic medical diagnostics (pulse rate, bloodpressure, weight, height, previous maladies and when, etc.), current ailments(e.g., cancer type such as liver, bone, etc.; heart condition, etc.), and so on.It is unlikely the medical insurer (or medical researcher, or...) is interested inthe details of your specific visit to a care provider on a particular occasion;indeed, the insurer may not even be interested in your aggregated visits over agiven period of time. Rather, interest may focus on all individuals (and theiraccumulated history) who have a particular condition (such as heart valve failure),or, maybe interest centers on the collection of individuals of a particulargender-age group with that condition. Thus, values are aggregated across allindividuals in the specific categories of interest. It is extremely unlikely thatall such individuals will have the same pulse rate, the same weight, and soforth. Instead, the aggregated values can take values across an interval, as ahistogram, as a list of possible values, etc. That is, the data set now consistsof so-called symbolic data.Automobile insurers may be interested in accident rates of categories suchas 26-year-old male drivers of red convertibles, and so on. Census data arefrequently in the form of symbolic data; e.g., housing characteristics for regionsmay be described as {owner occupied, .60; renter occupied, .35; vacant, .05}where 60% of the homes are owner occupied, etc.There are countless examples. The prevailing thread is that large data setsof single classical observations are aggregated in some way with the resultthat symbolic data perforce emerge. There are a myriad of ways these originaldata sets can be aggregated, with the actual form being driven by the scientificquestion/s of interest.On the other hand, some data are naturally symbolic in nature. For example,species are typically described by symbolic values; e.g., the mushroomspecies bernardi has a pileus cap width of [6, 7] cm. However, the particularmushroom in your hand may have a cap width of 6.2 cm, say. Pulse rates


L. Billard 325bounce around, so that an apparent rate of 64 (say) may really be 64 ± 2, i.e.,the interval [62, 66]. There are numerous examples.Detailed descriptions and examples can be found in Bock and Diday (2000)and Billard and Diday (2006). A recent review of current methodologies isavailable in Noirhomme-Fraiture and Brito (2011), with a non-technical introductionin Billard (2011). The original concept of symbolic data was introducedin Diday (1987). Note that symbolic data are not the same as fuzzy data;however, while they are generally different from the coarsening and groupingconcepts of, e.g., Heitjan and Rubin (1991), there are some similarities.The major issue then is how do we analyse these intervals (or, histograms,...)? Taking classical surrogates, such as the sample mean of aggregatedvalues for each category and variable, results in a loss of information.For example, the intervals [10, 20] and [14, 16] both have the same midpoint;yet they are clearly differently valued observations. Therefore, it is importantthat analytical methods be developed to analyse symbolic data directly so asto capture these differences. There are other underlying issues that pertainsuch as the need to develop associated logical dependency rules to maintainthe integrity of the overall data structure; we will not consider this aspectherein however.29.3 IllustrationsExample 29.1 Table 29.1 displays (in two parts) a data set of histogramvalued observations, extracted from Falduti and Taibaly (2004), obtained byaggregating by airline approximately 50,000 classical observations for flightsarriving at and departing from a major airport. For illustrative simplicity,we take three random variables Y 1 = flight time, Y 2 = arrival-delay-time,and Y 3 = departure-delay-time for 10 airlines only. Thus, for airline u =1,...,10 and variable j =1, 2, 3, we denote the histogram valued observationby Y uj = {[a ujk ,b ujk ),p ujk : k =1,...,s uj } where the histogram sub-interval[a ujk ,b ujk )hasrelativefrequencyp ujk with ∑ k p ujk = 1. The number ofsubintervals s uj can vary across observations (u) and across variables (j).Figure 29.1 shows the tree that results when clustering the data by aWard’s method agglomerative hierarchy algorithm applied to these histogramdata when the Euclidean extended Ichino–Yaguchi distance measure is used;see Ichino and Yaguchi (1994) and Kim and Billard (2011, 2013), for details.Since there are too many classical observations to be able to build an equivalenttree on the original observations themselves, we resort to using classicalsurrogates. In particular, we calculate the sample means for each variableand airline. The resulting Ward’s method agglomerative tree using Euclideandistances between the means is shown in Figure 29.2.


326 The present’s futureTABLE 29.1Airline data.Y 1 [0, 70) [70, 110) [110, 150) [150, 190) [190, 230) [230, 270)1 .00017 .10568 .33511 .20430 .12823 .0452672 .13464 .10799 .01823 .37728 .35063 .011223 .70026 .22415 .07264 .00229 .00065 .000004 .26064 .21519 .34916 .06427 .02798 .018485 .17867 .41499 .40634 .00000 .00000 .000006 .28907 .41882 .28452 .00683 .00076 .000007 .00000 .00000 .00000 .00000 .03811 .307938 .39219 .31956 .19201 .09442 .00182 .000009 .00000 .61672 .36585 .00348 .00174 .0000010 .76391 .20936 .01719 .00645 .00263 .00048Y 2 [−40, −20) [−20, 0) [0, 20) [20, 40) [40, 60) [60, 80)1 .09260 .38520 .28589 .09725 .04854 .030462 .09537 .45863 .30014 .07433 .03226 .016833 .12958 .41361 .21008 .09097 .04450 .027164 .06054 .44362 .33475 .08648 .03510 .018655 .08934 .44957 .29683 .07493 .01729 .037466 .07967 .36646 .28376 .10698 .06070 .037947 .14024 .30030 .29573 .18293 .03659 .010678 .03949 .40899 .33727 .12483 .04585 .022249 .07840 .44599 .21603 .10627 .04530 .0331010 .10551 .55693 .22989 .06493 .02363 .01074Y 3 [−15, 5) [5, 25) [25, 45) [45, 65) [65, 85) [85, 105)1 .67762 .16988 .05714 .03219 .01893 .014632 .84993 .07293 .03086 .01964 .01683 .004213 .65249 .14071 .06872 .04025 .02749 .016694 .77650 .14516 .04036 .01611 .01051 .005265 .63112 .24784 .04323 .02017 .02882 .002886 .70030 .12064 .06297 .04628 .02049 .012907 .73323 .16463 .04726 .01677 .01220 .003058 .78711 .12165 .05311 .01816 .00772 .006359 .71080 .12369 .05749 .03310 .01916 .0052310 .83600 .10862 .03032 .01408 .00573 .00286It is immediately apparent that the trees differ, even though both havethe same “determinant” — agglomerative, Ward’s method, and Euclideandistances. However, one tree is based on the means only while the other isbased on the histograms; i.e., the histogram tree of Figure 29.1, in addition tothe information in the means, also uses information in the internal variances ofthe observed values. Although the details are omitted, it is easy to show that,e.g., airlines (1, 2, 4) have similar means and similar variances overall; however,by omitting the information contained in the variances (as in Figure 29.2),


L. Billard 327TABLE 29.1Airline data (continued).Y 1 [270, 310) [310, 350) [350, 390) [390, 430) [430, 470) [470, 540]1 .07831 .07556 .02685 .00034 .00000 .000002 .00000 .00000 .00000 .00000 .00000 .000003 .00000 .00000 .00000 .00000 .00000 .000004 .03425 .02272 .00729 .00000 .00000 .000005 .00000 .00000 .00000 .00000 .00000 .000006 .00000 .00000 .00000 .00000 .00000 .000007 .34299 .21494 .08384 .01220 .00000 .000008 .00000 .00000 .00000 .00000 .00000 .000009 .00523 .00174 .00348 .00000 .00000 .0017410 .00000 .00000 .00000 .00000 .00000 .00000Y 2 [80, 100) [100, 120) [120, 140) [140, 160) [160, 200) [200, 240]1 .01773 .01411 .00637 .00654 .01532 .000002 .01403 .00281 .00281 .00000 .00281 .000003 .02094 .01440 .01276 .00884 .02716 .000004 .00797 .00661 .00356 .00051 .00220 .000005 .00865 .00576 .00576 .00576 .00865 .000006 .02883 .00835 .01366 .00835 .00531 .000007 .00762 .00305 .00152 .00762 .01372 .000008 .00817 .00635 .00227 .00136 .00318 .000009 .01916 .01394 .00871 .01220 .02091 .0000010 .00286 .00143 .00167 .00095 .00143 .00000Y 3 [105, 125) [125, 145) [145, 165) [165, 185) [185, 225) [225, 265]1 .00878 .00000 .00361 .00947 .00775 .000002 .00281 .00000 .00000 .00140 .00140 .000003 .01407 .00000 .01014 .01407 .01538 .000004 .00305 .00000 .00085 .00068 .00153 .000005 .00865 .00000 .00865 .00000 .00865 .000006 .01897 .00000 .00986 .00607 .00152 .000007 .00457 .00000 .00152 .00762 .00915 .000008 .00227 .00000 .00136 .00045 .00182 .000009 .01045 .00000 .01742 .01394 .00871 .0000010 .00095 .00000 .00072 .00048 .00024 .00000while airlines (1, 2) have comparable means, they differ from those for airline 4.That is, the classical surrogate analysis is based on the means only.ApolytheticdivisivetreebuiltontheEuclideanextendedIchino–Yaguchidistances for the histograms is shown in Figure 29.3; see Kim and Billard(2011) for this algorithm. The corresponding monothetic divisive tree is comparable.This tree is different again from those of Figures 29.1 and 29.2; thesedifferences reflect the fact that different clustering algorithms, along with differentdistance matrices and different methods, can construct quite differenttrees.


328 The present’s future12435698107FIGURE 29.1Ward’s method agglomerative tree based on histograms.Example 29.2 Figure 29.4 displays simulated individual classical observations(Y 1 ,Y 2 ) drawn from bivariate normal distributions N 2 (µ, Σ). Thereare five samples each with n = 100 observations. Sample S = 1 has meanµ = (5, 0), standard deviations σ 1 = σ 2 = .25 and correlation coefficientρ = 0; samples S=2,3 have µ =(1, 1), σ 1 = σ 2 = .25 and ρ = 0; and samplesS =4, 5haveµ =(1, 1), σ 1 = σ 2 =1andρ = .8. Each of the samples can beaggregated to produce a bivariate histogram observation Y s , s =1,...,5.When a divisive algorithm for histogram data is applied to these data, threeclusters emerge containing the observations C 1 = {Y 1 }, C 2 = {Y 2 ,Y 3 }, andC 3 = {Y 4 ,Y 5 }, respectively. In contrast, applying algorithms, e.g., a K-means12731049568FIGURE 29.2Ward’s method agglomerative tree based on means.


L. Billard 32965931082471FIGURE 29.3Polythetic divisive tree based on histograms.FIGURE 29.4Simulated data — How many clusters?


330 The present’s futuremethod, to the classical observations (or, to classical surrogates such as themeans), identifies only two clusters, viz., C 1 = {Y 1 } and C 2 = {Y 2 ,Y 3 ,Y 4 ,Y 5 }.That is, since information such as the internal variations is not part of aclassical analysis, the classical clustering analyses are unable to identify observationsY 2 and Y 3 as being different from observations Y 4 and Y 5 .Example 29.3 Consider the data of Table 29.2 where for simplicity, we restrictattention to minimum and maximum monthly temperatures for fourmonths only, January, April, July, and October, Y 1 − Y 4 ,respectively,in1988for n = 10 weather stations. The interval values for station u =1,...,10and variable j =1,...,4 are denoted by Y uj =[a uj ,b uj ]. Elevation is alsoincluded as Y 5 ;noteY 5 is a classical value and so is a special case of aninterval value with a u5 ≡ [a u5 ,a u5 ]. The data are extracted from http://dss.ucar.edu/datasets/ds578.5 which contain annual monthly weathervalues for several variables for many stations in China over many years.TABLE 29.2Temperature intervals and elevation.Station January April July October Elevationu [a u1 ,b u1 ] [a u2 ,b u2 ] [a u3 ,b u3 ] [a u4 ,b u4 ] a u51 [−18.4, −7.5] [−0.1, 13.2] [17.0, 26.5] [0.6, 13.1] 4.822 [−20.0, −9.6] [0.2, 11.9] [17.8, 27.2] [−0.2, 12.5] 3.443 [−23.4, −15.5] [−4.5, 9.5] [12.9, 23.0] [−4.0, 8.9] 14.784 [−27.9, −16.0] [−1.5, 12.0] [16.1, 25.0] [−2.6, 10.9] 4.845 [−8.4, 9.0] [1.7, 16.4] [10.8, 23.2] [1.4, 18.7] 73.166 [2.3, 16.9] [9.9, 24.3] [17.4, 22.8] [14.5, 23.5] 32.967 [2.8, 16.6] [10.4, 23.4] [16.9, 24.4] [12.4, 19.7] 37.828 [10.0, 17.7] [15.8, 23.9] [24.2, 33.8] [19.2, 27.6] 2.389 [11.5, 17.7] [17.8, 24.2] [25.8, 33.5] [20.3, 26.9] 1.4410 [11.8, 19.2] [16.4, 22.7] [25.6, 32.6] [20.4, 27.3] 0.02A principal component analysis on these interval-valued data producesthe projections onto the PC 1 × PC 2 space shown in Figure 29.5. Detailsof the methodology and the visualization construction can be found in Le-Rademacher and Billard (2012), and further details of this particular data setare in Billard and Le-Rademacher (2012).Notice in particular that since the original observations are hypercubesin R p space, so we observe that the corresponding principal components arehypercubes in PC-space. The relative sizes of these PC hypercubes reflect therelative sizes of the data hypercubes. For example, if we compare the observedvalues of stations u =5andu = 10 in Table 29.2, it is clear that the intervalsacross the variables for u = 5 are on balance wider than are those for


L. Billard 331FIGURE 29.5PCA based on intervals.u = 10; thus, the principal component hypercube is larger for u =5thanforu = 10. That is, the observation u = 5 has a larger internal variation. Theseinternal variations are a component of the covariance terms in the covariance(and correlation) matrix. This feature is not possible in a classical analysis,with the point observation in R p being transformed into but a point value inPC-space, as shown in Figure 29.6 for the classical principal component analysisperformed on the interval means. While both the symbolic and classicalanalyses showed the temperatures as being of comparable importance to PC 1with elevation being important only for PC 2 ,thevisualizationsthroughthePC hypercubes of Figure 29.5 are more informative than are the PC points ofFigure 29.6.29.4 ConclusionBy the time that Eddy (1986) considered the future of computers in statisticalresearch, it was already clear that a computer revolution was raising its headover the horizon. This revolution was not simply focussed on bigger and bettercomputers to do traditional calculations on a larger scale, though that too wasa component, then and now. Rather, more expansively, entirely new ways ofapproaching our art were to be the new currency of the looming 21st century.Early signs included the emergence of new methodologies such as the bootstrap(Efron, 1979) and Gibbs sampler (Geman and Geman, 1984), thoughboth owed their roots to earlier researchers. While clearly these and similarcomputational methodologies had not been feasible in earlier days thereby beinga product of computer advances, they are still classical approaches per se.


332 The present’s futureFIGURE 29.6PCA based on means.By the 1990s, COPSS Presidential Addresses referred to the upcoming informationand technological revolution, its waves already heading for the unwary;see, e.g., Kettenring (1997) and Billard (1995, 1997).However, the real advances will take quite different formats to those predictedin the 1990s. In a very prescient comment, Schweizer (1984) declaredthat “distributions are the numbers of the future.” The present is that future.Furthermore, today’s future consists of a new paradigm whereby new methodologies,and new theories to support those methodologies, must be developedif we are to remain viable players as data analysts. These new methods mustalso be such that the classical models of the still-present and past come outcorrectly as special cases of the new approaches.In this chapter, one such approach, viz., symbolic data, has been describedalbeit ever so briefly. While a study of the literature may at first suggest thereare many symbolic techniques currently available, in reality there are very fewand even then those few handle relatively narrowly defined situations.There are two major directions for future work: one is to develop the newmethodologies for new data structures and to extend the plethora of situationsthat a century or more of research in so-called classical statistics produced,while the other is to establish mathematical underpinning to support thosenew methods (somewhat akin to the theoretical foundations provided initiallyby Bickel and Freedman (1981), and Singh (1981), which validated the earlybootstrap work). One certainty is sure — the present-future demands that weengage our energies in addressing the myriad of issues surrounding large andcomplex data sets. It is an exciting time to be a part of this undertaking.


L. Billard 333ReferencesBickel, P.J. and Freedman, D.A. (1981). Some asymptotic theory for thebootstrap. The Annals of Statistics, 9:1196–1217.Billard, L. (1995). The roads travelled. 1995. Biometrics, 51:1–11.Billard, L. (1997). A voyage of discovery. Journal of American StatisticalAssociation, 92:1–12.Billard, L. (2011). Brief overview of symbolic data and analytic issues. StatisticalAnalysis and Data Mining, 4:149–156.Billard, L., and Diday, E. (2006). Symbolic Data Analysis: Conceptual Statisticsand Data Mining. Wiley,Chichester.Billard, L., and Le-Rademacher, J. (2012). Principal component analysis forinterval data. Wiley Interdisciplinary Reviews: Computational Statistics,4:535–540.Bock, H.-H. and Diday, E. (2000). Analysis of Symbolic Data: ExploratoryMethods for Extracting Statistical Information from Complex Data.Springer, Berlin.Diday, E. (1987). Introduction à l’approche symbolique en analyse desdonnées. Premières Journées Symbolique-Numérique, CEREMADE,UniversitéParis–Dauphine,Paris,France,pp.21–56.Eddy, W.F. (1986). Computers in statistical research. Statistical Science,1:419–437.Efron, B. (1979). Bootstrap methods: Another look at the jackknife. TheAnnals of Statistics, 7:1–26.Falduti, N. and Taibaly, H. (2004). Étude des retards sur les vols des compagniesaériennes. ReportCEREMADE,UniversitéParis–Dauphine,Paris,France.Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Transactions on PatternAnalysis and Machine Intelligence, 6:721–744.Goodman, A. (2011). Emerging topics and challenges for statistical analysisand data mining. Statistical Analysis and Data Mining, 4:3–8.Heitjan, D.F. and Rubin, D.B. (1991). Ignorability and coarse data. TheAnnals of Statistics, 19:2244–2253.


334 The present’s futureIchino, M. and Yaguchi, H. (1994). Generalized Minkowski metrics for mixedfeature type data analysis. IEEE Transactions on Systems, Man and Cybernetics,24:698–708.Kettenring, J.R. (1997). Shaping statistics for success in the 21st century.Journal of the American Statistical Association, 92:1229–1234.Kim, J. and Billard, L. (2011). A polythetic clustering process for symbolicobservations and cluster validity indexes. Computational Statistics andData Analysis, 55:2250–2262.Kim, J. and Billard, L. (2013). Dissimilarity measures for histogram-valuedobservations. Communications in Statistics: Theory and Methods,42:283–303.Le-Rademacher, J. and Billard, L. (2012). Symbolic-covariance principalcomponent analysis and visualization for interval-valued data. Journalof Computational and Graphical Statistics, 21:413–432.Noirhomme-Fraiture, M. and Brito, M.P. (2011). Far beyond the classicaldata models: Symbolic data analysis. Statistical Analysis and Data Mining,4:157–170.Schweizer, B. (1984). Distributions are the numbers of the future. In Proceedingsof the Mathematics of Fuzzy Systems Meeting (A. di Nola andA. Ventes, Eds.). Università diNapoli,Napoli,Italy,pp.137–149.Singh, K. (1981). On the asymptotic accuracy of Efron’s bootstrap. TheAnnals of Statistics, 9:1187–1195.


30Lessons in biostatisticsNorman E. BreslowDepartment of BiostatisticsUniversity of Washington, Seattle, WAToday’s medical journals are full of factual errors and false conclusions arisingfrom lack of statistical common sense. Reflecting on personal experiences, I arguethat statisticians can substantially improve medical science by informedapplication of standard statistical principles. Two specific areas are identifiedwhere lack of such input regularly produces faulty research. Statisticians areneeded more than ever to bring rigor to clinical research.30.1 IntroductionBiostatisticians develop and apply statistical concepts and methods to clinicalmedicine, to laboratory medicine and to population medicine or public health.During the fifty years since COPSS was organized, their work has becomeincreasingly important. Major medical journals often insist on biostatisticalreview of submitted articles. Biostatistics graduates are in high demand forwork in industry, government and academia. They occupy prominent positionsas heads of corporations and universities, deans of schools of public health anddirectors of major research programs.In spite of the heightened visibility of the profession, much of today’s medicalresearch is conducted without adequate biostatistical input. The result isnot infrequently a waste of public resources, the promulgation of false conclusionsand the exposure of patients to possible mistreatment. I describe a fewof the more common episodes of flawed research with which I have come incontact, which involve “immortal time” in follow-up studies and lack of propervalidation of discriminant rules. I discuss the lessons learned both from theseepisodes and more generally from my decades of work in childhood cancer.The primary focus of the chapter is on biostatistics in clinical medicine. Otherchapters in this volume discuss the role of statistics in laboratory medicine,especially genetics, and in public health.335


336 Lessons in biostatistics30.2 It’s the science that countsMy introduction to biostatistics was in graduate school. During the school yearasmallgroupfromtheStanfordstatisticsdepartmentmadethetrektothemedical school for a weekly seminar. There we learned from medical facultyand our professors about the research problems on which they were collaborating.During the summer we took jobs with local research organizations.At weekly meetings back on campus, we presented the problems stemmingfrom our work and got advice from each other and the professors on how toapproach them.One summer I worked at the state health department. There was considerableinterest in the possibility of an infectious origin for leukemia andspeculation that transmission of the putative infectious agent might occur betweenanimals and humans. The health department was conducting a censusof cancer occurrence in dogs and cats in Alameda county, and the epidemiologistswanted to evaluate possible space-time clustering of leukemia casesin people and in cats. The maps at their disposal, however, were inaccurate.Ascertainment of the geographic coordinates needed for quantitative analysiswas subject to substantial error. My assignment was to read up on spatialstatistical distributions and develop a measurement error model. I was havingconsiderable difficulty.I will never forget the stern advice I received from Professor Lincoln Mosesfollowing my presentation at the weekly meeting back on campus. “What youneed is a good set of maps,” he said. “Try the water company!” Obviously,in his mind, as later in mine, the best method of dealing with measurementerror was to avoid it! Bradford Hill gave similar advice:“One must go and seek more facts, paying less attention to the techniquesof handling the data and far more to the development andperfection of the methods of obtaining them.” (Hill, 1953)As it turned out, the East Bay Municipal Utilities District (EBMUD)had just completed a very extensive and costly mapping program. The mapswere so accurate that you had to decide where in the residence to plot thecase to determine the coordinates. Executives in charge of the program weredelighted to learn that their maps would serve not only corporate interestsbut also those of public health. Instead of working on a statistical methodsproblem, I spent my remaining time that summer on administrative issuesrelated to the use of the maps by the health department. A photograph ofme with health department and EBMUD officials poring over the maps waspublished in the corporate magazine and hangs in my office today. The lessonlearned was invaluable.I had a similar experience shortly after my arrival at the University ofWashington in 1968. Having applied for a position in the Mathematics Department,not realizing it was in the process of downsizing and discharging


N.E. Breslow 337most of its statisticians, I wound up as a biostatistician in the Medical School.My support came mainly from service as statistician to the Children’s CancerStudy Group. In those days the MDs who chaired the protocol study committeessometimes compiled the data themselves (one dedicated researcher meticulouslyarranged the flow sheets on her living room floor) and sent me simpledata summaries with a request for calculation of some standard statistic. I wasappalled by the routine exclusion from randomized treatment comparisons ofpatients who had “inadequate trials” of chemotherapy due to early discontinuationof the assigned treatment regimen or early death. It was clear that amore systematic approach to data collection and analysis was needed.My colleague Dick Kronmal, fortunately, had just developed a computersystem to store and summarize data from longitudinal studies that generatedmultiple records per patient (Kronmal et al., 1970). This system wasperfect for the needs of the children’s cancer group. It allowed me to quicklyestablish a Data and Statistical Center both for the group and for the NationalWilms Tumor Study (NWTS), whose steering committee I joined as afounding member in 1969. (Wilms is an embryonal tumor of the kidney thatoccurs almost exclusively in children.) Once again the lesson learned was that“development and perfection of the methods of obtaining the data” were atleast as important to the overall scientific enterprise as were the statisticalmethods I subsequently helped develop to “handle” right censored survivaldata. Having me, as statistician, take control of data collection and processing,while sharing responsibility for data quality with the clinicians, made iteasier for me to then also exercise some degree of control over which patientswere included in any given analysis.My experience was not unusual. The role of biostatisticians in cooperativeclinical research was rapidly evolving as the importance of their contributionsbecame more widely appreciated. It soon became commonplace for them tooccupy leadership positions within the cooperative group structure, for example,as heads of statistics departments or as directors of independently fundedcoordinating centers.A steady diet of service to clinical trial groups, however, can with timebecome tedious. It also interferes with production of the first-authored papersneeded for promotion in academia. One way to relieve the tedium, and to generatethe publications, is to get more involved in the science. For example, thebiostatistician can propose and conduct ancillary studies that utilize the valuabledata collected through the clinical trials mechanism. The first childhoodleukemia study in which I was involved was not atypical in demonstrating thattreatment outcomes varied much more with baseline host and disease characteristics,in this case age and the peripheral white blood cell count, than withthe treatments the study was designed to assess (Miller et al., 1974). Thisresult was apparently a revelation to the clinicians. They jumped on it to proposetreatment stratification based on prognostic factor groups in subsequenttrials, so that the most toxic and experimental treatments were reserved forthose who actually needed them. Subsequently, I initiated several studies of


338 Lessons in biostatisticsprognosis in Wilms tumor that resulted in greater appreciation for the adverseoutcomes associated with tumor invasion of regional lymph nodes and ultimatelyto changes in the staging system. Fascinated by how well Knudson’s(Knudson, Jr., 1971) 2-hit mutational model explained the genetic epidemiologyof retinoblastoma, another embryonal tumor of a paired organ (in thiscase the eye rather than the kidney), I conducted studies of the epidemiologyof Wilms tumor that provided strong evidence for genetic heterogeneity, anexplanation for its lower incidence and younger ages-at-onset in Asians and ahypothesis regarding which survivors were at especially high risk of end stagerenal disease in young adulthood (Breslow and Beckwith, 1982; Breslow et al.,2006; Lange et al., 2011). Since 1991, I have served as Principal Investigatoron the NIH grant that funds the continued follow-up of NWTS survivors for“late effects” associated with Wilms tumor and its treatment. This study hasoccupied an increasingly important place in my research repertoire.30.3 Immortal timeIn my opening lecture to a class designed primarily for second year doctoralstudents in epidemiology, I state the First Rule of survival analysis: Selectioninto the study cohort, or into subgroups to be compared in the analysis, mustnot depend on events that occur after the start of follow-up. While this pointmay be obvious to a statistician, certainly one trained to use martingale argumentsto justify inferences about how past history influences rates of futureevents, it was not obvious to many of the epidemiologists. The “immortaltime” bias that results from failure to follow the rule has resulted, and continuesregularly to result, in grossly fraudulent claims in papers published inthe most prestigious medical journals.My first exposure to the issue came soon after I started work with thechildren’s cancer group. The group chair was puzzled by a recently publishedarticle that called into question the standard criteria for evaluation of treatmentresponse in acute leukemia. These included the stipulation that patientswith a high bone marrow lymphocyte count (BMLC) be excluded from the excellentresponse category. Indeed, a high BMLC often presaged relapse, definedas 5% or higher blast cells in the marrow. The article in question, however,reported that patients whose lymphocytes remained below the threshold levelof 20% of marrow cells throughout the period of remission tended to haveshorter remissions than patients whose BMLC exceeded 20% on at least oneoccasion. Although I knew little about survival analysis, and had not yet articulatedthe First Rule, I was familiar with random variation and the tendencyof maxima to increase with the length of the series. Intuition suggested thatthe article’s conclusion, that there was “no justification for excluding a patient


N.E. Breslow 339from complete remission status because of bone marrow lymphocytosis,” waserroneous.Accordingly, using new data from the children’s cancer group, I attemptedto convince my clinical colleagues that the reasoning was fallacious (Breslowand Zandstra, 1970). I first replicated the earlier findings by demonstratingthat, when patients were classified into three categories accordingto the BMLC values observed during remission, the “remission duration”(progression-free survival) curve for the group with highest maximum BMLCwas on top and that for the group with lowest maximum BMLC was on thebottom. When patients were classified according to the average of their BMLCvalues during remission, however, the ordering was reversed. Both comparisonswere highly statistically significant. Of course, even the analysis basedon average counts violated the First Rule. Nowadays one would employ timedependentcovariates or stratification to evaluate how the history of BMLCaffected future relapse rates. The experience was a valuable lesson about theimportance of “statistical thinking” in clinical research.Many biostatisticians were sensitized to the issue of immortal time byMitch Gail’s critique of early claims of the efficacy of heart transplantation(Gail, 1972). To illustrate the problems with the statistical approach takenby cardiac surgeons in those days, he compared survival curves from timeof admission as a transplant candidate according to whether or not the patienthad subsequently received a transplant. He pointed out that patientswho died early had less opportunity to receive a transplant, whereas thosewho did receive one were guaranteed, by definition, to have survived longenough for a suitable donor to be found. In effect, person-months of observationprior to transplant were unfairly subtracted from the total person-monthsfor the control group, biasing their survival rate downwards, and added to theperson-months for the transplant group, biasing their survival rate upwards.Correct accounting for the timing of transplant in the statistical comparisonwas subsequently undertaken by several statistical teams, for example, byuse of time-dependent covariates in the Cox model (Crowley and Hu, 1977).When the data were properly analyzed, transplant as performed at the timewas found to have little benefit.Nick Day and I, in the section of our second IARC monograph (Breslowand Day, 1987) on allocation of person-time to time-dependent exposure categories,called attention to a fallacious claim of decreasing death rates withincreasing duration of work in the polyvinyl-chloride industry. Here the investigatorshad contrasted standardized mortality ratios (of numbers of deathsobserved to those expected from age-specific population rates) among workersemployed for 0–14 versus 15+ years in the industry. Not only all deathsoccurring beyond 15 years, however, but also all person-time accumulated bypersons employed for 15+ years, had been allocated to the latter group. Dayand I stated: “The correct assignment of each increment in person-years offollow-up is to that same exposure category to which a death would be assignedshould it occur at that time.” In other words, the first 15 years of employment


340 Lessons in biostatisticstime for the vinyl-chloride workers whose employment continued beyond thatpoint should have been assigned to the 0–14 group. When this correction wasmade, the 15+ year exposure group had a slightly higher mortality ratio thandid the 0–14 year group.Faculty at McGill University in Montréal, Canada, have repeatedly calledattention to erroneous conclusions in the medical literature stemming fromimmortal time bias. One recent article takes issue with the finding that actorswho won an Oscar lived on average nearly four years longer than those in amatched control group (Sylvestre et al., 2006). The authors pointed out that,as long ago as 1843, William Farr warned against the hazards of “classifyingpersons by their status at the end of follow-up and analyzing them as if theyhad been in these categories from the outset” (Farr, 1975). Farr continued“... certain professions, stations and ranks are only attained by personsadvanced in years; and some occupations are followed only in youth;hence it requires no great amount of sagacity to perceive that ‘the meanage at death’ [···] cannot be depended on in investigating the influenceof occupation, rank and profession upon health and longevity.”Noting the relatively early ages at death of Cornets, Curates and JuvenileBarristers, he concluded wryly: “It would be almost necessary to make themGenerals, Bishops and Judges — for the sake of their health.”Mistakes are made even when investigators are seemingly aware of theproblem. A 2004 report in The New England Journal of Medicine examinedthe effect on survival of a delay in kidney transplantation among children withend stage renal disease. The authors stated:“Delay in kidney transplantation as a potential risk factor for earlydeath was analyzed by comparing mortality among groups with differentlengths of time until transplantation. To account for survival bias,delay as a predictor of death was analyzed beginning 2 years after theinitiation of renal replacement therapy. There was no significant differencein mortality observed among groups with different lengths oftime until transplantation (Fig 3)” (McDonald and Craig, 2004).Close examination of their Figure 3, however, leads to a different conclusion.Survival curves from two years after onset of renal replacement therapy(dialysis or transplant) were shown separately for those with preemptive transplant(no delay), less than one-year delay and 1–2 years delay, categories basedon information available at the start of follow-up at two years. They are inthe anticipated order, with the survival outcomes best for those having hadan immediate transplant followed in succession by those having had a 0–1 or1–2 year delay. Had the authors simply added a fourth curve for those not yettransplanted by year 2, they would have found that it lay below the others.This would have confirmed the anticipated rank order in survival outcomesunder the hypothesis that longer delay increased subsequent mortality. However,they mistakenly split the fourth group into those who never received a


N.E. Breslow 341transplant and those who did so at some point after two years. The survivalcurve for the “no transplant” group was far below all the others, with manydeaths having occurred early on prior to a suitable donor becoming available,while the curve for the “≥ 2 years” group was second from highest due to immortaltime. The clear message in the data was lost. I have used this widelycited paper as the basis for several exam and homework questions. Studentsoften find the lessons about immortal time to be the most important theylearned from the class.Imentionedearliermydissatisfactionwiththeexclusionofpatientswith“inadequate trials” from randomized treatment comparisons, a policy thatwas widely followed by the children’s cancer group when I joined it. Such “perprotocol” analyses constitute another common violation of the First Rule.Exclusion of patients based on events that occur after the start of follow-up,in particular, the failure to receive protocol treatment, invites bias that isavoided by keeping all eligible patients in the study from the moment theyare randomized. Analyses using all the eligible patients generate results thatapply to a real population and that are readily compared with results from likestudies. Attempts to clearly describe the fictitious populations to which the perprotocol analyses apply are fraught with difficulty. My colleague Tom Fleminghas thoughtfully discussed the fundamental principle that all patients be keptin the analysis following randomization, its rationale and its ramifications(Fleming, 2011).30.4 MultiplicityWhether from cowardice or good sense, I consciously strived throughout mycareer to avoid problems involving vast amounts of data collected on individualsubjects. There seemed to be enough good clinical science to do with thelimited number of treatment and prognostic variables we could afford to collectfor the childhood cancer patients. The forays into the epidemiology of Wilmstumor similarly used limited amounts of information on gender, ages at onset,birth weights, histologic subtypes, precursor lesions, congenital malformationsand the like. This allowed me to structure analyses using a small number ofvariables selected a priori to answer specific questions based on clearly statedhypotheses.My successors do not have this luxury. Faced with the revolution in molecularbiology, they must cope with increasingly high dimensional data in an attemptto assist clinicians deliver “personalized medicine” based on individual“omics” (genomics, epigenomics, proteomics, transcriptomics, metabolomics,etc.) profiles. I hope that widespread enthusiasm for the new technologies doesnot result in a tremendous expenditure of resources that does little to advancepublic health. This can be avoided if statisticians demand, and are given, a


342 Lessons in biostatisticsmeaningful role in the process. I am impressed by how eagerly my youngercolleagues, as well as some of my peers, have responded to the challenge.The problems of multiplicity were brought home to me in a forceful waywhen I read an article based on data from the 3rd and 4th NWTS trialssupplied by our pathologist to a group of urologists and pathologists at theprestigious Brady Institute at Johns Hopkins Hospital (JHH); see Partin et al.(1994). Regrettably, they had not solicited my input. I was embarrassed thatapublicationbasedonNWTSdatacontainedsuchblatanterrors.Foronething, although our pathologist had supplied them with a case-control samplethat was overweighted with children who had relapsed or had advanced diseaseat onset, they ignored the design and analysed the data as a simple randomsample. Consequently their Kaplan–Meier estimates of progression-free survivalwere seriously in error, suggesting that nearly half the patients with“favorable histology” relapsed or died within five years of diagnosis, whereasthe actual fraction who did so was about 11%.A more grievous error, however, was using the same data both to constructand to validate a predictive model based on a new technology thatproduced moderately high dimensional quantitative data. Determined to improveon the subjectivity of the pathologist, the JHH team had developed atechnique they called nuclear morphometry to quantify the malignancy gradingof Wilms and other urologic tumors, including prostate. From the archivedtumor slide submitted by our pathologist for each patient, they selected 150blastemal nuclei for digitizing. The digitized images were then processed usinga commercial software package known as Dyna CELL. This produced for eachnucleus a series of 16 shape descriptors including, for example, area, perimeter,two measures of roundness and two of ellipticity. For each such measure17 descriptive statistics were calculated from the distribution of 150 values:Mean, variance, skewness, kurtosis, means of five highest and five lowest values,etc. This yielded 16 × 17 = 242 nuclear morphometric observations perpatient. Among these, the skewness of the nuclear roundness factor (SNRF)and the average of the lowest five values for ellipticity as measured by the feretdiameter (distance between two tangents on opposite sides of a planar figure)method (LEFD) were found to best separate cases from controls, each yieldingp = .01 by univariate logistic regression. SNRF, LEFD and age, a variableI had previously identified as an important prognostic factor, were confirmedby stepwise regression analysis as the best three of the available univariatepredictors. They were combined into a discriminant function that, needless tosay, did separate the cases from the controls used in its development, althoughonly with moderate success.


N.E. Breslow 343TABLE 30.1Regression coefficients (± SEs) in multivariate nuclear morphometric discriminantfunctions fitted to three data sets † .ProspectiveCase-Control SampleSampleRisk NWTS + JHH NWTS Alone NWTSFactor (n = 108) ∗ (n = 95) (n = 218)Age (yr) .02 .013 ± .008 .017 ± .005SNRF 1.17 1.23 ± .52 −.02 ± .26LEFD 90.6 121.6 ± 48.4 .05 ± 47.5† From Breslow et al. (1999). Reproduced with permission. c○1999American Society of Clinical Oncology. All rights reserved.∗ From Partin et al. (1994)I was convinced that most of this apparent success was due to the failureto account for the multiplicity of comparisons inherent in the selection ofthe best 2 out of 242 measurements for the discriminant function. With goodcooperation from JHH, I designed a prospective study to validate the ability oftheir nuclear morphometric score to predict relapse in Wilms tumor (Breslowet al., 1999). I identified 218 NWTS-4 patients who had not been included inthe case-control study, each of whom had an archived slide showing a diagnosisby our pathologist of a Wilms tumor having the same “favorable” histologicsubtype as considered earlier. The slides were sent to the JHH investigators,who had no knowledge of the treatment outcomes, and were processed in thesame manner as for the earlier case-control study. We then contrasted resultsobtained by re-analysis of data for the 95 NWTS patients in the case-controlstudy, excluding 13 patients from JHH who also had figured in the earlierreport, with those obtained by analysis of data for the 218 patients in theprospective study.The results, reproduced in Table 30.1, were instructive. Regression coefficientsobtained using a Cox regression model fitted to data for the 95 NWTSpatients in the original study are shown in the third column. They were comparableto those reported by the JHH group based on logistic regression analysisof data for the 95 NWTS plus 13 JHH patients. These latter coefficients, shownin the second column of the table, were used to construct the nuclear morphometricscore. Results obtained using Cox regression fitted to the 218 patientsin the prospective study, of whom 21 had relapsed and one had died of toxicity,were very different. As I had anticipated, the only variable that wasstatistically significant was the known prognostic factor age. Coefficients forthe two nuclear morphometric variables were near zero. When the originalnuclear morphometric score was applied to the prospective data, using thesame cutoff value as in the original report, the sensitivity was reduced from


344 Lessons in biostatistics75% to 71% and the specificity from 69% to 56%. Only the inclusion of agein the score gave it predictive value when applied to the new data.No further attempts to utilize nuclear morphometry to predict outcomesin patients with Wilms tumor have been reported. Neither the original paperfrom JHH nor my attempt to correct its conclusions have received more thana handful of citations. Somewhat more interest was generated by use of thesame technique to grade the malignancy of prostate cancer, for which theJHH investigators identified the variance of the nuclear roundness factor asthe variable most predictive of disease progression and disease related death.While their initial studies on prostate cancer suffered from the same failure toseparate test and validation data that compromised the Wilms tumor casecontrolstudy, variance of the nuclear roundness factor did apparently predictadverse outcomes in a later prospective study.Today the public is anxiously awaiting the anticipated payoff from their investmentin omics research so that optimal medical treatments may be selectedbased on each patient’s genomic or epigenomic make-up. Problems of multiplicityinherent in nuclear morphometry pale in comparison to those posed bydevelopment of personalized medicine based on omics data. A recent reportfrom the Institute of Medicine (IOM) highlights the important role that statisticiansand statistical thinking will play in this development (IOM, 2012). Thiswas commissioned following the exposure of serious flaws in studies at DukeUniversity that had proposed tests based on gene expression (microarray)profiles to identify cancer patients who were sensitive or resistant to specificchemotherapeutic agents (Baggerly and Coombes, 2009). Sloppy data managementled to major data errors including off-by-one errors in gene lists andreversal of some of the sensitive/resistant labels. The corrupted data, coupledwith inadequate information regarding details of computational procedures,made it impossible for other researchers to replicate the published findings.Questions also were raised regarding the integrity of the validation process.Ultimately, dozens of papers were retracted from major journals, three clinicaltrials were suspended and an investigation was launched into financial andintellectual/professional conflicts of interest.The IOM report recommendations are designed to prevent a recurrenceof this saga. They emphasize the need for evaluation of a completely “lockeddown” computational procedure using, preferably, an independent validationsample. Three options are proposed for determining when a fully validated testprocedure is ready for clinical trials that use the test to direct patient management.To ensure that personalized treatment decisions based on omics teststruly do advance the practice of medicine, I hope eventually to see randomizedclinical trials where test-based patient management is compared directly withcurrent standard care.


N.E. Breslow 34530.5 ConclusionThe past 50 years have witnessed many important developments in statisticaltheory and methodology, a few of which are mentioned in other chapters ofthis COPSS anniversary volume. I have focussed on the place of statistics inclinical medicine. While this sometimes requires the creation of new statisticalmethods, more often it entails the application of standard statistical principlesand techniques. Major contributions are made simply by exercising the rigorousthinking that comes from training in mathematics and statistics. Havingstatisticians take primary responsibility for data collection and managementoften improves the quality and integrity of the entire scientific enterprise.The common sense notion that definition of comparison groups in survivalanalyses should be based on information available at the beginning of followup,rather than at its end, has been around for over 150 years. When dealingwith high-dimensional biomarkers, testing of a well defined discriminant ruleon a completely new set of subjects is obviously the best way to evaluate itspredictive capacity. Related cross-validation concepts and methods have beenknown for decades. As patient profiles become more complex, and biologymore quantitative, biostatisticians will have an increasingly important role toplay in advancing modern medicine.ReferencesBaggerly, K.A. and Coombes, K.R. (2009). Deriving chemosensitivity fromcell lines: Forensic bioinformatics and reproducible research in highthroughputbiology. The Annals of Applied Statistics, 3:1309–1334.Breslow, N.E. and Beckwith, J.B. (1982). Epidemiological features of Wilmstumor — Results of the National Wilms Tumor Study. Journal of theNational Cancer Institute, 68:429–436.Breslow, N.E., Beckwith, J.B., Perlman, E.J., and Reeve, A.E. (2006). Agedistributions, birth weights, nephrogenic rests, and heterogeneity in thepathogenesis of Wilms tumor. Pediatric Blood Cancer, 47:260–267.Breslow, N.E. and Day, N.E. (1987). Statistical Methods in Cancer ResearchII: The Design and Analysis of Cohort Studies. IARCScientificPublications.International Agency for Research on Cancer, Lyon, France.Breslow, N.E., Partin, A.W., Lee, B.R., Guthrie, K.A., Beckwith, J.B., andGreen, D.M. (1999). Nuclear morphometry and prognosis in favorable


346 Lessons in biostatisticshistology Wilms tumor: A prospective reevaluation. Journal of ClinicalOncology, 17:2123–2126.Breslow, N.E. and Zandstra, R. (1970). A note on the relationship betweenbone marrow lymphocytosis and remission duration in acute leukemia.Blood, 36:246–249.Crowley, J. and Hu, M. (1977). Covariance analysis of heart transplant survivaldata. Journal of the American Statistical Association, 72:27–36.Farr, W. (1975). Vital Statistics: A Memorial Volume of Selections from theWritings of William Farr. Scarecrow Press, Metuchen, NJ.Fleming, T.R. (2011). Addressing missing data in clinical trials. Annals ofInternal Medicine, 154:113–117.Gail, M.H. (1972). Does cardiac transplantation prolong life? A reassessment.Annals of Internal Medicine, 76:815–817.Hill, A.B. (1953). Observation and experiment. New England Journal ofMedicine, 248:995–1001.Institute of Medicine (2012). Evolution of Translational Omics: LessonsLearned and the Path Forward. The National Acadamies Press, Washington,DC.Knudson, A.G. Jr. (1971). Mutation and cancer: Statistical study ofretinoblastoma. Proceedings of the National Academy of Sciences, 68:820–823.Kronmal, R.A., Bender, L., and Mortense, J. (1970). A conversational statisticalsystem for medical records. Journal of the Royal Statistical Society,Series C, 19:82–92.Lange, J., Peterson, S.M., Takashima, J.R., Grigoriev, Y., Ritchey, M.L.,Shamberger, R.C., Beckwith, J.B., Perlman, E., Green, D.M., and Breslow,N.E. (2011). Risk factors for end stage renal disease in non-WT1-syndromic Wilms tumor. Journal of Urology, 186:378–386.McDonald, S.P. and Craig, J.C. (2004). Long-term survival of children withend-stage renal disease. New England Journal of Medicine, 350:2654–2662.Miller, D.R., Sonley, M., Karon, M., Breslow, N.E., and Hammond, D. (1974).Additive therapy in maintenance of remission in acute lymphoblasticleukemia of childhood — Effect of initial leukocyte count. Cancer,34:508–517.


N.E. Breslow 347Partin, A.W., Yoo, J.K., Crooks, D., Epstein, J.I., Beckwith, J.B., andGearhart, J.P. (1994). Prediction of disease-free survival after therapyin Wilms tumor using nuclear morphometric techniques. Journal of PediatricSurgery, 29:456–460.Sylvestre, M.P., Huszti, E., and Hanley, J.A. (2006). Do Oscar winners livelonger than less successful peers? A reanalysis of the evidence. The Annalsof Internal Medicine, 145:361–363.


31AvignetteofdiscoveryNancy FlournoyDepartment of StatisticsUniversity of Missouri, Columbia, MOThis story illustrates the power of statistics as a learning tool. Through aninterplay of exploration and carefully designed experiments, each with theirspecific findings, an important discovery is made. Set in the 1970s and 80s,procedures implemented with the best intentions were found to be deadly.Before these studies, only hepatitis was known to be transmitted through contaminatedblood products. We discovered that the cytomegalovirus could betransferred through contaminated blood products and developed novel bloodscreening techniques to detect this virus just before it become well known forits lethality among persons with AIDS. We conclude with some commentsregarding the design of experiments in clinical trials.31.1 IntroductionToday blood banks have institutionalized sophisticated procedures for protectingthe purity of blood products. The need for viral screening proceduresare now taken for granted, but 40 years ago transmission of viral infectionsthrough the blood was understood only for hepatitis. Here I review the genesisof a hypothesis that highly lethal cytomegalovirus (CMV) pneumonia couldresult from contaminated blood products and the experiments that were conductedto test this hypothesis. The question of cytomegalovirus infection resultingfrom contaminated blood products arose in the early days of bonemarrow transplantation. So I begin by describing this environment and howthe question came to be asked.E. Donnell Thomas began to transplant bone marrow into patients fromdonors who were not their identical twins in 1969. By 1975, his Seattle transplantteam had transplanted 100 patients with acute leukemia (Thomas et al.,1977). Bone marrow transplantation is now a common treatment for childhoodleukemia with a good success rate for young people with a well-matched donor.349


350 A vignette of discoveryEarly attempts to transplant organs, such as kidneys, livers and hearts, werenot very successful until it was determined that matching patient and donorat a few key genetic loci would substantially reduce the risk of rejection. Drugsto suppress the natural tendency of the patient’s immune system to attack aforeign object further reduced the risk of rejection. In bone marrow transplantation,a good genetic match was also needed to prevent rejection. However,because a new and foreign immune system was being transplanted with thebone marrow, drugs were used not only to reduce the risk of rejection, butto keep the transplanted marrow from deciding that the whole patient was aforeign object and mounting an auto-immune-like attack called graph versushost disease (GVHD). Furthermore, in order both to destroy diseases of theblood and to prevent rejection of the new bone marrow, high doses of irradiationand drugs were given prior to the transplant. Eradicating as completelyas possible all the patient’s bone marrow destroys the bulk of the patient’sexisting immune system.Since typically two to three weeks are required before the transplantedbone marrow’s production of blood cells resumes, the Seattle team tried toanticipate problems that could result from a patient having an extended periodof severely compromised blood production, and hence extremely poorimmune function. It was well known that granulocytes (the white blood cells)fight infection. In order to protect the patient from bacterial infection, anelaborate and expensive system was devised to assure that the patient wouldbe supported with plenty of donated granulocytes. When the transplant teammoved into the newly built Fred Hutchinson Cancer Research Center in 1975,a large portion of one floor was dedicated to this task. On rows of beds, thebone marrow donors lay for hours each day with needles in both arms. Bloodwas taken from one arm, and passed through a machine that filtered off thegranulocytes and returned the rest of the blood to the donor through theother arm.Typically, the same person donated both the bone marrow and the granulocytes,with the hope that the genetic match would prevent the patient frombecoming allergic to the granulocyte transfusions. The bone marrow donor wasexpected to stay in town for at least six weeks and lie quietly with needles inboth arms every day so that the transplant patient could fight off threats ofinfection. Being a marrow donor required a huge time commitment.31.2 CMV infection and clinical pneumoniaEarly in the development of the bone marrow transplant procedure, it wasclear that patients with sibling donors who were not identical twins were athigh risk of death caused by CMV pneumonia (Neiman et al., 1977). Figure31.1 — displaying ten years of data from Meyers et al. (1982) — shows


N. Flournoy 351FIGURE 31.1Incidence of CMV and all nonbacterial pneumonias expressed as percentageper patient-day for each week after allogeneic marrow transplant.the incidence of pneumonia as a function of time following transplantationfrom genetically matched siblings who were not identical twins. The incidencedistribution, expressed as the percentage per patient-day each week, is slightlyskewed toward the time of transplant with a mode at about seven weeks. Of525 patients, 215 (38%) had nonbacterial pneumonia, with CMV isolated in40% of the cases and other viruses identified in 29% of the cases.Eighty-four percent of the 215 pneumonia cases were fatal. In contrast,CMV pneumonia was not a cause of death among those patients whose bonemarrow donors were identical twins (Appelbaum et al., 1982). At the time, weerroneously speculated that this difference was due to the fact that patientswithout identical twin donors received more drugs to suppress the immune systemthan did patients with identical twin donors with no risk of graft rejection.However, the identical twins did not receive their transplant care at the FredHutchinson Cancer Research Center, but at University of Washington Hospital,where there was no provision for providing prophylactic granulocytes. Wefailed to recognize that the twins’ reduced drug therapy was confounded withtheir not getting prophylactic granulocyte transfusions.CMV is a common virus in the environment and by the time people reachadulthood, it can be found in about half the population. CMV does not causeproblems in healthy individuals, but because it was diagnosed most frequentlyin pneumonia cases and these were the most fatal, the push was on to characterizethe course of the illness, identify prognostic factors and find therapies.In order to standardize diagnostic procedures so that the course of the riskperiod could be established and to identify cases of viral infection early so


352 A vignette of discoverythat intervention trials might be feasible, we instituted a standard scheduleof testing blood and urine samples for the presence of CMV virus and measuringanti-CMV antibody titers. CMV infection was defined to be present ifthe CMV virus was isolated in the routine blood and urine tests, or if it wasfound in the course of diagnosing pneumonia, or if antibody titers rose fourfold (seroconverted). Between October 1977 and August 1979, surveillancetesting of blood and urine samples and antibody to CMV was measured in158 patients and their donors prior to transplantation and periodically followingtransplant. The incidence of CMV infection in urine samples was approximatelythe same, regardless of the presence or absence of antibody to CMVbefore transplant, in either the donor or the recipient (Meyers et al., 1980).However, antibody titers increased (as measured by a summary statistic, themean response stimulation index) after 41–60 days following transplant amongpatients who contracted CMV pneumonia (Figure 31.2). Note how the modeof the simulation index among patients with CMV pneumonia coincides withthe time of peak incidence shown in Figure 31.1. Among patients whose CMVtiters were positive pretransplant (seropositive), average titers remained high(see Figure 31.3). But among patients whose pretransplant titers were negative(seronegative), the stimulation index remained low until about 60 daysafter transplant and then began to rise without regard to the marrow donor’spretransplant titer. So although we dismissed the idea that CMV infectionwas being transferred through the donor’s bone marrow, this study suggestedprognostic factors that might be manipulated in an intervention.FIGURE 31.2Mean response of lymphocytes to cytomegalovirus antigen. Numbers in parenthesesrepresent the sample size in each group.


N. Flournoy 353FIGURE 31.3Mean response of lymphocytes to cytomegalovirus. Numbers in parenthesesrepresent the sample size in each group.Having designed and implemented a multidisciplinary research informationsystem (Flournoy and Hearne, 1990) before the advent of commercialsystems, we began mining the data for prognostic factors for CMV infectionamong those patients transplanted between 1979 and 1982 who had at leastfour surveillance cultures (Meyers et al., 1986). The surveillance data showedthat just over half (51.5%) of the 545 recipients of marrow transplants withoutan identical twin donor became infected with CMV. CMV was culturedfrom 280 (51.4%) of the patients; 168 (30.8%) had at least a four-fold rise intiters (seroconverted). Much attention in this study focused on the relationshipbetween the surveillance test results and the subsequent developmentof pneumonia. Also, the relationship between surveillance results, pneumoniaand the complication of the transplant procedure GVHD were investigated.An association between GVHD and CMV clearly existed, suggesting that fatalitiesdue to CMV would be reduced by eliminating GVHD. This was a falselead down another blind alley.Among patients who had CMV in their blood prior to transplant, 69%subsequently became infected (i.e., they either seroconverted and/or began toexcrete CMV in their urine). Among patients without CMV in their bloodprior to transplant, 57% of those whose donors did and 28% of those whosedonors did not have CMV in their blood subsequently became infected. These


354 A vignette of discoveryobservations suggested that patients having CMV in their blood prior to transplantwere at high risk of infections; and that in patients without CMV intheir blood prior to transplant, the donor might be passing the infection tothe patient, either through the marrow transplant itself or through the bloodtransfusions given after transplant. This was our first clue that the granulocytetransfusions might be transmitting CMV; but it was impossible to fathom thatour large effort dedicated to providing prophylactic granulocyte transfusionswas so harmful. We believed that a randomized clinical trial would confirmthat there was some unknown confounding variable that would explain awaythis association.A proportional hazards regression analysis was performed separately forpatients with and without CMV in their blood prior to transplant. Amongseropositive patients, all the significant covariates were demographic variables,disease characteristics or treatment complications for which no known controlwas possible. Thus the models did not suggest possible interventions. However,among seronegative patients, the relative rate of CMV infection was 2.3 timesgreater (p = .0006) if the granulocyte transfusions were also found to bepositive for CMV. This was the second clue.31.3 InterventionsTo run a clinical trial of subjects receiving and not receiving prophylacticgranulocytes required developing a higher throughput procedure for identifyingCMV in blood products. While exploratory discussions began with theKing County Blood Bank about how to develop the needed screening procedures,we began an alternative clinical trial that did not require novel bloodanalytic methods be developed.In light of the data mining results, we focused on the patients whose pretransplanttiters were negative. Thinking that prophylactic anti-CMV immuneglobulin might prevent CMV infection from developing, eligible patients wererandomized to receive globulin or nothing, with stratifications for the use ofprophylactic granulocyte transfusions and the donor’s titer to CMV. At theonset of the CMV immune globulin study, we took the association betweenCMV infection and granulocyte transfusions seriously enough to stratify forit, but not so seriously as to study it directly. To be eligible for this study(Meyers et al., 1983), a patient had to be seronegative for antibody to CMVprior to transplant and to not excrete the virus into the urine for the first twoweeks after transplantation. Patients excreting virus during this period werepresumed to have been infected with CMV before transplantation and wereexcluded from final analysis.Figure 31.4 compares Kaplan–Meier estimates of the probability of CMVinfection as a function of week after transplant for globulin recipients and


N. Flournoy 355FIGURE 31.4Kaplan–Meier probabilities of acquiring cytomegalovirus infection. The numbersin parentheses indicate the sample size of patients still at risk of infectionat the beginning of each interval. The risk is different for globulin recipientsand controls at p = .03 by the Mantel–Cox test.control patients. The overall difference in infection rates between globulin recipientsand controls was not significant. CMV infection rates by strata areshown in Table 31.1. Seeking any lead to act upon, the difference observedamong patients receiving no granulocytes provided hope that globulin mightbe effective in a larger study of some subset of subjects. The difference inrates depending upon whether or not the granulocyte donor was seronegativeor seropositive finally led us to question seriously the role of granulocytetransfusions in CMV infection.We were thunderstruck by the possibility that we were transmitting CMVthrough the blood. The impact should this observation be confirmed in acontrolled randomized study is described by Meyers et al. (1983):“Screening blood products for antibody to cytomegalovirus, or moreappropriately for virus or viral antigens (techniques that are not yetavailable), increases the burden on blood-banking facilities, decreasesthe pool of blood donors, and, most importantly, decreases the rapidavailability of fresh blood products such as platelets. The use of an immuneglobulin is therefore an attractive practical alternative among patientswho need large amounts of fresh blood products such as plateletsand for whom screening of blood products is less practical.”The New England Journal of Medicine rejected our paper (Meyers et al., 1983)because our observations concerning granulocyte transfusions were externalto the hypotheses postulated in the initial experimental design. But these


356 A vignette of discoveryTABLE 31.1Incidence of CMV infection.NoPatient Treatment Globulin GlobulinGranulocytes from seropositive donors 7/8 (88.5%) 6/7 (85.7%)Granulocytes from seronegative donors 1/5 (20.0%) 0/6 (00.0%)No granulocytes 2/17 (11.8%) 8/19 (42.1%)observations led to new hypotheses and the next randomized study. Whileit is extremely important to distinguish between observations obtained bycontrolled randomized intervention and those obtained otherwise, hypothesisgeneration is an essential task.We spent a year working with King County Blood Bank to develop screeningprocedures, set up laboratory equipment and train technicians in order toconduct a randomized clinical trial. Although we restricted the study to patientswho were seronegative for CMV in two consecutive tests and who hadnot received any unscreened blood recently, more patients were available forstudy than the blood bank could handle. Therefore, we studied the prophylacticcapability of immune globulin at the same time in a randomized 2 × 2factorial design. CMV immune globulin had no effect [data not shown] on therate of CMV infection (Bowden et al., 1986).The effect of only giving CMV seronegative blood transfusions, controllingfor the marrow donor’s CMV status, is summarized in Table 31.2.Among patients whose marrow donors were seronegative, those randomizedto receive seronegative granulocyte transfusions had a 4.5% infection rate,whereas those randomized to receive unscreened transfusions had a 32% infectionrate. What is more, the one patient with a seronegative donor whowas assigned to receive seronegative blood products and subsequently becameinfected with CMV actually mistakenly received several seropositive transfusions.TABLE 31.2Incidence of CMV infection among 85 patients studied for at least 62 daysafter transplantation.Marrow Donor’s Randomized to GranulocytesCMV Status Seronegative UnscreenedSeronegative 1/22 (04.5%) 8/25 (32.0%)Seropositive 3/12 (25.0%) 5/16 (31.3%)


N. Flournoy 357As the study proceeded, blood bank personnel became increasingly agitatedas they considered the ramifications of a significant finding. Blood banksall over the country would have to set up screening programs; the cost would beenormous, they warned. The study results went out to blood banks, however,and viral screening procedures were put into place. The timing was fortuitousbecause the AIDS crisis was building. Today the idea that viral infections canbe transmitted through the blood is taken for granted.31.4 ConclusionsOur experience suggests three key steps to successful experimental designs.First, ask an important, well defined question. Too often this step receivesinsufficient attention and resources, and it may be the most difficult. Determininga well structured and important question can involve considerabledata collection, exploratory analysis and data mining. Data mining withoutdata collection targeted to a specific question may yield valuable findings,but many important questions now go begging in the push to devote timeand resources to analyzing existing databases. (When medical fellows beganrotating through the Fred Hutchinson Cancer Center wards, they had to conceiveof a study, execute it and draft a paper. The questions they raised wereinventive and some resulted in life-saving findings. When I left in 1986, havingcreated a shared interdisciplinary research information system, the fellowstypically looked at what data was available and studied a question that couldbe answered with the available data. I found the relative lack of imaginationin the questions being asked extremely distressing, and feel responsible forenabling it.)The problem associated with moving too quickly to confirmatory studieshas fairly recently been acknowledged by the pharmaceutical industry. Butthis acknowledgement seems slow to translate into substantially increased resourcesbeing devoted to early phase studies.The second key step is to develop interventions that focus sharply on thequestion at hand and to randomize subjects to the interventions and to thestandard of care or a placebo. This step is operationally well developed. Thethird step is to replicate the experiment and encourage others to do likewise.While this series of studies resulted in important discoveries, other seriesof studies were not so fruitful. In particular, two-arm randomized studies ofchemotherapy and radiation schedules were largely uninformative. These variablesare virtually continuous, whereas the important variables in the studiesof CMV were mostly inherently discrete. With such multidimensional continuousrandom variables in an environment in which ethics preclude utilizing alarge sample space covering unknown territory, our unbridled enthusiasm forthe two-arm clinical trial as a learning tool was misplaced. Coming to appre-


358 A vignette of discoveryciate the necessity of exploratory analysis in such settings led to my currentwork in adaptive designs and their analysis.ReferencesAppelbaum, F., Meyers, J., Fefer, A., Flournoy, N., Cheever, N., Greenberg,M., Greenberg, P., Hackman, R., and Thomas, E. (1982). Nonbacterialnonfungal pneumonia following marrow transplantation in 100 identicaltwins. Transplantation,33:265–268.Bowden, R., Sayers, M., Flournoy, N., Newton, B., Banaji, M., Thomas, E.,and Meyers, J. (1986). Cytomegalovirus immune globulin and seronegativeblood products to prevent primary cytomegalovirus infection aftermarrow transplant. New England Journal of Medicine, 314:1006–1010.Flournoy, N. and Hearne, L. (1990). Quality Control for a Shared MultidisciplinaryDatabase. Marcel Dekker, New York, pp. 43–56.Meyers, J.D., Flournoy, N., and Thomas, E.D. (1980). Cytomegalovirus infectionand specific cell-mediated immunity after marrow transplant. TheJournal of Infectious Diseases, 142:816–824.Meyers, J.D., Flournoy, N., and Thomas, E.D. (1982). Nonbacterial pneumoniaafter allogeneic marrow transplantation. Reviews of Infectious Diseases,4:1119–1132.Meyers, J.D., Flournoy, N., and Thomas, E.D. (1986). Risk factors for cytomegalovirusinfection after human marrow transplant. The Journal ofInfectious Diseases, 153:478–488.Meyers, J.D., Leszczynski, J., Zaia, J.A., Flournoy, N., Newton, B., Snydman,D.R., Wright, G.G., Levin, M.L., and Thomas, E.D. (1983). Preventionof cytomegalovirus infection by cytomegalovirus immune globulin aftermarrow transplantation. Annals of Internal Medicine, 98:442–446.Neiman, P.E., Reeves, W., Ray, G., Flournoy, N., Lerner, K.G., Sale, G.E.,and Thomas, E.D. (1977). A prospective analysis of interstitial pneumoniaand opportunistic viral infection among recipients of allogeneic bonemarrow graphs. The Journal of Infectious Diseases, 136:754–767.Thomas, E.D., Buckner, C.D., Banaji, M., Clift, R.A., Fefer, A., Flournoy, N.,Goodell, B.W., Hickman, R.O., Lerner, K.G., Neiman, P.E., Sale, G.E.,Sanders, J.E., Singer, J., Stevens, M., Storb, R., and Weiden, P.L. (1977).One hundred patients with acute leukemia treated by chemotherapy, totalbody irradiation, and allogeneic marrow transplantation. Blood, 49:511–533.


32Statistics and public health researchRoss L. PrenticeDivision of Public Health SciencesFred Hutchinson Cancer Research Center, Seattle, WAStatistical thinking and methods have much to contribute in the area of multidisciplinarypublic health research. Often barriers to research progress canonly be addressed using innovative statistical methods. Furthermore, statisticianshave an important role to play in helping to shape the populationscience research agenda. These points will be illustrated using topics in nutritionalepidemiology, preventive intervention development, and randomizedtrial design and analysis.32.1 IntroductionIt is a pleasure to join the celebration of the 50th anniversary of COPSS,which continues to fulfill a stimulatory and valuable coordinating role amongthe participating statistical societies. This anniversary provides a reminder ofthe impressive advances in statistical theory and application over the past 50years, as is certainly the case in the biomedical research context, and in thepublic health research area more specifically.Much of biomedical research involves the follow-up of cohorts of individualsto observe health-related outcomes. Most frequently this work involves humanstudies, but the statistical methods employed may also apply to studies inmodel systems. A typical study may involve relating some treatment, or someset of study subject characteristics or exposures, to the time until a diseaseevent occurs. Early statistical proposals for the analysis of such “failure time”data in therapeutic trial settings involved the use of linear models, usually forthe logarithm of failure time. Because of the usual presence of right censorship,error distributions having closed-form right tails were often employed,rather than traditional Normal models. At the same time, methodology forepidemiological applications were developing a focus on relative disease rates,or closely related odds ratios with the Mantel and Haenszel (1959) summary359


360 Statistics and public health researchodds ratio estimator standing out as a key statistical and epidemiological contribution.Also nonparametric methods, and hazard rate estimators, enteredthrough the Kaplan and Meier (1958) survivor function estimator.These modeling approaches came together in the Cox (1972) regressionmodel, one of the most influential and highly cited statistical papers of alltime. The semiparametric Cox model extended the ratio modeling into a fullhazard ratio regression approach, while also incorporating a nonparametricbaseline hazard rate that valuably relaxed parametric models, such as theWeibull model, that had previously been used. Furthermore, the parametrichazard ratio component of this semiparametric model could be relaxed inimportant ways by including, for example, stratification on key confoundingfactors, treatment by time interactions to relax proportionality assumptions,and stochastic time-dependent covariates to examine associations for covariatescollected during study follow-up. For relatively rare outcomes, the Coxmodel proportional hazards special case is well approximated by a correspondingodds ratio regression model, and logistic regression soon became the mainstayapproach to the analysis of case-control epidemiological data (Prenticeand Pyke, 1979).Over the past 30 years, valuable statistical methods have been developedfor data structures that are more complex than a simple cohort follow-up witha univariate failure time outcome. Many such developments were motivatedby substantive challenges in biomedicine. These include nested case-controland case-cohort sampling procedures to enhance estimation efficiency withrare disease outcomes; methods for the joint analysis of longitudinal and failuretime data; sequential data analysis methods; missing and mismeasureddata methods; multivariate failure time methods, including recurrent eventand correlated/clustered failure time methods; and event history models andmethods more generally. Many of these developments along with correspondingstatistical theory have been summarized in book form where pertinentreferences may be found; see, e.g., Andersen et al. (1993) and Kalbfleisch andPrentice (2002).In the last decade, foci for the development of statistical methods inbiomedical applications have included the incorporation of high-dimensionalgenomic data, with regularization approaches to deal with dimensionality anddata sparsity as in, e.g., Tibshirani (1994); methods for the development,evaluation and utilization of biomarkers for many purposes, including earlydisease detection, disease recurrence detection, and objective exposure assessment;and methods for disease risk prediction that integrate with conceptsfrom the diagnostic testing literature. Relative disease rate modeling, and theCox model in particular, provided a foundation for many of these developments.


R.L. Prentice 36132.2 Public health researchWhile many of the developments just listed were motivated by clinical researchapplications, the public health and disease prevention areas also provided considerablestimulation. In fact, the public health research arena is one thatmerits consideration as a career emphasis for many more statistically trainedinvestigators: If we look back over this same 50-year period, we will see that50 years ago papers were just starting to emerge on cigarette smoking andlung cancer, with tobacco companies advertising the many virtues of smoking,including the ability to “soothe the throat.” Currently, with the smokingpatterns in place since women adopted patterns of smoking behavior similar tothose for men over the past 20–30 years, the estimated lung cancer mortalityrate among current smokers, whether male or female, is a whopping 25 timesthat in never smokers, with substantial elevations also in mortality rates forchronic obstructive pulmonary disease, multiple other cancers, and coronaryheart disease as well; see, e.g., Thun et al. (2013). Vigorous and organizedresearch programs are needed for exposures having these types of horrendoushealth consequences to be identified early, and for the responsible exposuresto be eliminated or reduced.An example of the potential of public health research when supported byregulatory action is provided by the use of postmenopausal hormones. Postmenopausalestrogens came on the market about 50 years ago, and the useof estrogens throughout the lifespan was promoted as the way for women toretain youth and vitality, while avoiding the vasomotor symptoms associatedwith menopause. By the mid-1970s, it was apparent that the widely used estrogensthat derived from the urine of pregnant mares led to a 5–10 fold increasein the risk of uterine cancer, so a progestin was added to protect the uterus.Observational epidemiological data mostly collected over the subsequent 20years seemed supportive of the utility of those preparations (conjugated equineestrogens for women who were post-hysterectomy; estrogens plus medroxyprogesteroneacetate for women with a uterus), with reported benefits for heartdisease, fracture and dementia prevention, among other health benefits.However, a different picture emerged when these regimens were put to thetest in randomized controlled trials (Writing Group for the Women’s HealthInitiative Investigators, 2002; Anderson et al., 2004). This was especially thecase for combined estrogens plus progestin, where health benefits were exceededby health risks, including an early elevation in heart disease, sustainedelevations in stroke, a major elevation in breast cancer risk, and an increase inprobable dementia. These trial results were instrumental in leading to suitablepackage insert warnings by the US Food and Drug Administration, and to amajor change in the use of these regimens, with about 70% of women takingestrogens plus progestin stopping abruptly in 2002, along with about 40%of women taking estrogens alone. One can project that this and subsequent


362 Statistics and public health researchchanges in hormone therapy practices have led, for example, to about 15,000to 20,000 fewer women developing breast cancer each year since 2003 in theUnited States alone, along with tens of thousands of additional such womenelsewhere in the world.Moving to another public health topic, obesity is the epidemic of our time.It is clear that overweight and obesity arise from a sustained imbalance overtime in energy consumed in the diet compared to energy expended at rest andthrough physical activity. Obesity is an established risk factor for many ofthe chronic diseases that are experienced in great excess in Western societies,including vascular diseases and several major cancers, and diabetes. However,specific knowledge from nutritional or physical activity epidemiology as towhich dietary and activity patterns can be recommended are substantiallylacking and, at any rate, are not sufficiently compelling to stimulate the societalchanges that may be needed to begin to slow and reverse the obesityepidemic. For example, needed changes may involve personal choices in foodselection, preparation and consumption patterns; choices away from a sedentarylifestyle; food supply and distribution changes; changes in city design;restrictions in advertising; and taxation changes, to cite but a few. Of course,favorable dietary and physical activity patterns may have health benefits thatgo well beyond normal weight maintenance.The remainder of this short contribution will elaborate some of the researchbarriers to public health research progress in some areas just mentioned, withfocus on statistical issues.32.3 Biomarkers and nutritional epidemiologyWhile other application areas also grapple with exposure assessment issues,these problems appear to dominate in the nutritional epidemiology researcharea. For example, an international review (World Health Organization, 2003)of nutritional epidemiology research identified few dietary exposures that areassociated with vascular diseases or cancer, with most reports based on selfreporteddiet, typically using a food frequency questionnaire (FFQ) approachwhere the study subject reports consumption frequency and serving size overthe preceding few months for a list of foods.A lack of consistency among epidemiological reports on specific dietary associationshas stimulated a modest focus on dietary assessment measurementerror over the past 25–30 years. Much of this work has involved comparisonsof FFQ data to corresponding data using other self-report tools, such as 24-hour dietary recalls (24-HRs), or several days of food records, to examine FFQmeasurement error properties. However, for a few important dietary factors,including total energy consumption and protein consumption, one can obtainobjective dietary assessments, at least for relativity short periods of time,


R.L. Prentice 363from urinary excretion markers. When those biomarker measures are comparedto self-report data, one sees strong positive measurement error correlationsamong FFQs, 24-HRs and four-day food records (4-DFRs); see Prenticeet al. (2011).The implication is that biomarker data, but not data using another selfreport,need to be used to assess self-report measurement error, and to calibratethe self-report data for use in nutritional epidemiology association studies.Studies to date using this type of regression calibration approach tendto give quite different results from traditional analyses based on self-reportdata alone, for example, with strong positive associations between total energyconsumption with heart disease and breast, colorectal and total cancerincidence; see, e.g., Prentice and Huang (2011).From a statistical modeling perspective, calibrated dietary exposure estimatestypically arise from linear regression of (log-transformed) biomarkervalues on corresponding self-report estimates and on such study subject characteristicsas body mass index, age, and ethnicity. These latter variables arequite influential in explaining biomarker variation, as may in part reflect systematicbiases in dietary reporting. For example, while persons of normalweight tend to show little energy under-reporting, obese persons underestimatesubstantially, in the 30–50% range on average (Heitmann and Lissner,1995). These types of systematic biases can play havoc with disease associationanalyses if not properly addressed.Measurement error correction methods are not easy for nutritional epidemiologiststo grasp, and are not so easy even for nutritionally-orientedstatisticians. A logical extension of the biomarker calibration work conductedto date is a major research emphasis on nutritional biomarker development, toproduce measurement error–corrected consumption estimates for many morenutrients and foods. Statisticians, in conjunction with nutritional and epidemiologicalcolleagues, can play a major role in establishing the rationalefor, and the design of, such a nutritional biomarker development enterprise,which may entail the conduct of sizeable human feeding studies. For example,such a feeding study among 150 free-living Seattle participants in theWomen’s Health Initiative is currently nearing completion, and will examinecandidate biomarkers and higher dimensional metabolic profiles for novelnutritional biomarker development.32.4 Preventive intervention development and testingClosely related to the development of biomarkers of exposure, is the use ofbiomarkers for preventive intervention development. While there is a ratherlarge enterprise for the development of therapeutic interventions, the developmentof innovative disease prevention interventions is less impressive. One


364 Statistics and public health researchreason for this discrepancy is that prevention may not share the same fiscalincentives as treatment, since persons whose disease has been delayed oraverted are usually not individually identifiable. Furthermore, the types ofspecimens needed for relevant biological measures (e.g., gene expression profiles),are frequently not available in the context of studies of large cohorts ofhealthy persons. As a result, preventive interventions that have been studiedin randomized controlled trials have mostly involved pill taking approaches,with rationale derived from observational epidemiology or borrowed from precedingtherapeutic trials.Specifically, there have been few trials of behavioral interventions withchronic disease outcomes. As an exception, the Diabetes Prevention ProgramResearch Group (Diabetes Prevention Program Research Group, 2002) randomizedtrial among 3234 persons having impaired glucose tolerance demonstrateda major benefit for Type 2 diabetes incidence with a combined dietaryand physical activity intervention. Also, the Women’s Health Initiative lowfatdietary modification trial (Prentice et al., 2006) among 48,835 ostensiblyhealthy postmenopausal women demonstrated a modest reduction in its breastcancer primary outcome, but the reduction didn’t meet the usual requirementsfor statistical significance (log-rank significance level of .07). There has neverbeen a full-scale physical activity intervention trial with chronic disease outcomes.Statistical methods for high-dimensional data analysis and biological networkdevelopment may be able to help fill the preventive intervention developmentgap. For example, changes in proteomic or metabolomic profiles maybe able to combine with changes in conventional risk factors for targeted diseasesin intermediate outcome intervention trials of practical size and expenseto select among, and provide the initial evaluation of, potential preventiveinterventions in a manner that considers both efficacy and safety.Also, because of cost and logistics, few full-scale disease prevention trialscan be conducted, regardless of the nature of the intervention. Innovativehybrid designs that combine the rather comprehensive profiling of the previousparagraph with case-control data for targeted outcomes that also include thesame types of high-dimensional biologic data may be able to produce testsof intervention effects on chronic disease of acceptable reliability for mostpurposes, at costs that are not extreme. Interventions meeting criteria in suchhybrid designs, that also have large public health potential, could then be putforward with a strong rationale for the few full-scale randomized trials withdisease outcomes that can be afforded.


R.L. Prentice 36532.5 Clinical trial data analysis methodsAs noted in the Introduction, statistical methods are rather well developed forthe comparison of failure times between randomized groups in clinical trials.However, methods for understanding the key biological pathways leading toan observed treatment effect are less well developed. Efforts to explain treatmentdifferences in terms of post-randomization biomarker changes may belimited by biomarker sample timing issues and temporal aspects of treatmenteffects. Furthermore, such efforts may be thwarted by measurement error issuesin biomarker assessment. Biomarker change from baseline may be highlycorrelated with treatment assignment, implying likely sensitivity of mediationanalysis to even moderate error in intermediate variable assessment.Another area in need of statistical methodology development is that ofmultivariate failure time data analysis. While Kaplan–Meier curves, censoreddata rank tests, and Cox regression provide well-developed tools for the analysisof univariate failure time data, corresponding established tools have notstabilized for characterizing dependencies among failure times, and for examiningtreatment effects jointly with a set of failure time outcomes. For example,in the context of the postmenopausal hormone therapy trials mentioned earlier(Anderson et al., 2004; Writing Group for the Women’s Health InitiativeInvestigators, 2002), one could ask whether data on stroke occurrence canbe used to strengthen the estimation of treatment effects on coronary heartdisease, and vice versa, in a nonparametric manner. The lack of standardizedapproaches to addressing this type of question can be traced to the lack ofa suitable nonparametric maximum likelihood estimation of the multivariatesurvivor function, which could point the way to nonparametric and semiparametriclikelihood approaches to the analysis of more complex multivariatefailure time data structures.32.6 Summary and conclusionStatistical thinking and innovation have come to play a major role throughoutbiomedical research during the 50 years of COPSS’ existence. Public healthaspects of these developments have lagged somewhat due to the need to relysubstantially on purely observational data for most purposes, for practicalreasons. Such observational data are valuable and adequate for many purposes,but they may require innovative biomarker supplementation for exposuresthat are difficult to assess, as in nutritional and physical activityepidemiology. This could include supplementation by intermediate outcome,or full-scale, randomized prevention trials for topics of great public healthimportance, such as postmenopausal hormone therapy; and supplementation


366 Statistics and public health researchby mechanistic and biological network data for timely identification of thehealth effects of exposures, such as cigarette smoking, that are not amenableto human experimentation.These are among the most important research needs for the health of thepopulations we serve. These populations are keenly interested in, and highlysupportive of, public health and disease prevention research. Statisticians haveas crucial a role as any other disciplinary group in responding to this interestand trust, and statistical training is highly valuable for participation andleadership roles in shaping and carrying out the needed research.AcknowledgementsThis work was supported by grants CA53996 and CA119171, and contractHHSN268720110046C from the National Institutes of Health.ReferencesAndersen, P.K., Borgan, Ø., Gill, R.D., and Keiding, N. (1993). StatisticalModels Based on Counting Processes. Springer,NewYork.Anderson, G.L., Limacher, M.C., Assaf, A.R., Bassford, T., Beresford, S.A.,Black, H.R., Bonds, D.E., Brunner, R.L., Brzyski, R.G., Caan, B. et al.(2004). Effects of conjugated equine estrogen in postmenopausal womenwith hysterectomy: The Women’s Health Initiative randomized controlledtrial. Journal of the American Medical Association, 291:1701–1712.Cox, D.R. (1972). Regression models and life-tables (with discussion). Journalof the Royal Statistical Society, Series B, 34:187–220.Diabetes Prevention Program Research Group (2002). Reduction in the incidenceof Type 2 diabetes with lifestyle intervention or metformin. NewEngland Journal of Medicine, 346:393–403.Heitmann, B.L. and Lissner, L. (1995). Dietary underreporting by obeseindividuals—Is it specific or non-specific? British Medical Journal,311:986–989.Kalbfleisch, J.D. and Prentice, R.L. (2002). The Statistical Analysis of FailureTime Data, 2nd edition. Wiley, New York.


R.L. Prentice 367Kaplan, E.L. and Meier, P. (1958). Nonparametric estimation from incompleteobservations. Journal of the American Statistical Association,53:457–481.Mantel, N. and Haenszel, W. (1959). Statistical aspects of the analysis ofdata from retrospective studies of disease. Journal of the National CancerInstitute, 22:719–748.Prentice, R.L., Caan, B., Chlebowski, R.T., Patterson, R., Kuller, L.H., Ockene,J.K., Margolis, K.L., Limacher, M.C., Manson, J.E., Parker, L.M. etal. (2006). Low-fat dietary pattern and risk of invasive breast cancer.Journal of the American Medical Association, 295:629–642.Prentice, R.L. and Huang, Y. (2011). Measurement error modeling and nutritionalepidemiology association analyses. The Canadian Journal of Statistics,39:498–509.Prentice, R.L., Mossavar-Rahmani, Y., Huang, Y., Van Horn, L., Beresford,S.A., Caan, B., Tinker, L., Schoeller, D., Bingham, S., Eaton, C.B. etal. (2011). Evaluation and comparison of food records, recalls, and frequenciesfor energy and protein assessment by using recovery biomarkers.American Journal of Epidemiology, 174:591–603.Prentice, R.L. and Pyke, R. (1979). Logistic disease incidence models andcase-control studies. Biometrika, 66:403–411.Thun, M.J., Carter, B.D., Feskanich, D., Freedman, N.D., Prentice, R.L.,Lopez, A.D., Hartge, P., and Gapstur, S.M. (2013). 50-year trends insmoking-related mortality in the United States. New England Journal ofMedicine, 368:351–364.Tibshirani, R.J. (1994). Regression shrinkage and selection via the LASSO.Journal of the Royal Statistical Society, Series B, 58:267–288.World Health Organization (2003). Diet, Nutrition and the Prevention ofChronic Diseases. Report of a Joint WHO/FAO Expert Consultation,vol. 916. World Health Organization, Genève, Switzerland.Writing Group for the Women’s Health Initiative Investigators (2002). Risksand benefits of estrogen plus progestin in healthy postmenopausal women:Principal results from the Women’s Health Initiative randomized controlledtrial. Journal of the American Medical Association, 288:321–33.


33Statistics in a new era for finance andhealth careTze Leung LaiDepartment of StatisticsStanford University, Stanford, CAWe are entering a new era in finance in the wake of the recent financial crisisand financial reform, and in health care as the provisions of the AffordableCare Act are being implemented from 2010 to 2020. We discuss opportunitiesand challenges for the field of statistics in this new era.33.1 IntroductionThe past few years witnessed the beginning of a new era in financial marketsand in the US health care system. In March 2010, landmark health care reformwas passed through two federal statutes: the Patient Protection and AffordableCare Act and subsequently amended by the Health Care and EducationReconciliation Act. A few months later, the Dodd–Frank Wall Street Reformand Consumer Protection Act was signed into federal law on July 21, 2010,in response to widespread calls for regulatory reform following the 2007–08financial crisis. Since this year marks the 50th anniversary of the COPSS, itseems timely to discuss how statistical science can help address the challengesof this new era for finance and health care and to suggest some outreachopportunities for the field of statistics.We begin with health care in Section 33.2. One of the provisions of the PatientProtection and Affordable Care Act is the establishment of a non-profitPatient-Centered Outcomes Research Institute to undertake comparative effectivenessresearch (CER), examining the “relative health outcomes, clinicaleffectiveness, and appropriateness” of different medical treatments. This involvesthe design of comparative studies of the treatments and their statisticalanalysis, and Section 33.2 discusses the limitations of standard study designswhen applied to CER and describes some innovative designs for compara-369


370 Statistics in a new erative effectiveness trials. Section 33.3 proceeds further to discuss innovativedesigns to improve the efficiency of clinical trials in the development of newtreatments.In Section 33.4, after reviewing the flaws in credit risk modeling and managementthat led to the 2007–08 financial crisis, we discuss how these flawscan be addressed by using better statistical methods. Section 33.5 continuesthis discussion on the role of statistics in financial and risk modeling in thenew era after the Dodd–Frank Act. Some concluding remarks are given inSection 33.6.33.2 Comparative effectiveness research clinical studiesOne approach to CER is to use observational studies, including analysis ofclaims or registry data; see, e.g., Stukel et al. (2007). As pointed out by Shihand Lavori (2013), such an approach involves “confounding by indication,”the tendency for clinicians and patients to choose treatments with their anticipatedeffects in mind. This leads to bias in estimating the effectiveness,which has to be handled by statistical adjustments and modeling techniques,or instrumental variables methods, or some combination. An obvious way toremove confounding is a randomized trial. However, conventional randomizedtrial designs are not only too costly but also ineffective in changing medicalpractice. An example is the Antihypertensive and Lipid Lowering Treatmentto Prevent Heart Attack Trial (ALLHAT), which was a randomized,double-blind, multi-center clinical trial designed to recruit 40,000 hypertensivepatients to be randomized to a diuretic treatment (chlorthalidone) andthree alternative antihypertensive pharmacologic treatments. Patients werefollowed every three months for the first year and every four months thereafterfor an average of six years of follow-up. This landmark CER trial costover $100 million. The results showed no difference in the prevention of heartattack and the superiority of chlorthalidone in preventing one or more forms ofcardiovascular disease (ALLHAT Collaborative Research Group, 2002). Yet, afew years later, the impact of the trial was found to be disappointing becauseof difficulty in pursuading doctors to change, scientific disagreement aboutthe interpretation of the results, and heavy marketing by the pharmaceuticalcompanies of their own drugs; see Lai and Lavori (2011).Section 4 of Lai and Lavori (2011) describes some innovative approachesthat are promising to meet the challenges of designing CER clinical trials. Oneis sequential multiple-assignment randomization for dynamic treatment strategiesin the management of patients with chronic diseases, as in Thall et al.(2007) who describe a two-stage randomized trial of twelve different strategiesof first-line and second-line treatments of androgen-independent prostate cancer.Another is equipoise-stratified randomization used by the STAR*D trial


T.L. Lai 371(Rush et al., 2004) that compares seven treatment options in patients who didnot attain a satisfactory response with citalopram, an inhibitor antidepressant.After receiving citalopram, participants without sufficient symptomatic benefitwere eligible for randomization among these options. A novel feature ofthe study design is that it ascertains before randomization the set of optionsthat the patient-clinician dyad consider to be equally reasonable, given thepatient’s preferences, and his or her state after a trial of citalopram. Thisset of options characterizes the patient’s Equipoise Stratum (ES). A total of1429 patients were randomized under this scheme. The largest ES were the“Medication Switch Only” group, allowing randomization among the threemedications (40%) and the “Medication Augmentation Only,” allowing randomizationbetween two options (29%). The “Any Augmentation” (10%) and“Any Switch” (7%) were the next largest, and only 5% of patients were randomizedamong options that contrasted a switch and augment condition. Inretrospect, it became clear that patients (and their clinicians) were roughlydivided into two groups, those who obtained partial benefit from citalopramand therefore were interested in augmentation, and those who obtained nobenefit and were interested only in switching. Thus, the ES design allowedthe study to self-design in assigning patients to the parts of the experimentthat were relevant to current practice and to patient preferences.A third approach is to design point-of-care (POC) clinical trials which canbe regarded as experiments embedded into clinical care; see Fiore et al. (2011)on a VA-sponsored trial that compares the effectiveness of two insulin dosingregimens for hospitalized diabetic patients. In POC trials, subjects are randomizedat the health care encounter, clinician equipoise defines the referencepopulation, and baseline and/or outcome data are captured through electronicmedical records. By using outcome-adaptive randomization, POC trials integrateexperimentation into implementation and learn sequentially the superiortreatment(s). This is similar to the classical multi-arm bandit problem (Lai,2001) except that POC uses adaptive randomization to implement it in aclincial setting with clinician equipoise and patient’s informed consent. Groupsequential generalized likelihood ratio tests with efficient outcome-adaptiverandomization to multiple arms have recently been developed and can beused for POC trials (Lai and Liao, 2012; Shih and Lavori, 2013).33.3 Innovative clinical trial designs in translationalmedicineBesides CER studies, Lai and Lavori (2011) describe novel design methodsfor clinical studies of personalized treatments and targeted cancer therapies intranslational medicine. “From bench to bedside” — in which “bench” refers tolaboratory experiments to study new biochemical principles and novel molec-


372 Statistics in a new eraular compounds and “bedside” refers to new treatments developed after preclinicalanimal studies and Phase I, II, and III trials involving human subjects—isamaximofevidence-basedtranslationalmedicalresearch.The development of imatinib, the first drug to target the genetic defectsof a particular cancer while leaving healthy cells unharmed, exemplifies thismaxim and has revolutionized the treatment of cancer. A Phase I clinicaltrial treating CML (chronic myeloid leukemia) patients with the drug beganin June 1998, and within six months remissions had occurred in all patientsas determined by their white blood cell counts returning to normal. In asubsequent five-year study on survival, which followed 553 CML patients whohad received imatinib as their primary therapy, only 5% of the patients diedfrom CML and 11% died from all causes during the five-year period. Moreover,there were few significant side effects; see Druker et al. (2006). Such remarkablesuccess of targeted therapies has led to hundreds of kinase inhibitors andother targeted drugs that are in various stages of development in the presentanticancer drug pipeline.Most new targeted treatments, however, have resulted in only modestclinical benefit, with less than 50% remission rates and less than one yearof progression-free survival, unlike a few cases such as trastuzumab in HER2-positive breast cancer, imatinib in CML and GIST, and gefitinib and erlotinibin non-small cell lung cancer. While the targeted treatments are devised to attackspecific targets, the “one size fits all” treatment regimens commonly usedmay have diminished their effectiveness, and genomic-guided and risk-adaptedpersonalized therapies that are tailored for individual patients are expectedto substantially improve the effectiveness of these treatments. To achieve thispotential for personalized therapies, the first step is to identify and measurethe relevant biomarkers. The markers can be individual genes or proteins orgene expression signatures. The next step is to select drugs (standard cytotoxins,monoclonal antibodies, kinase inhibitors and other targeted drugs) basedon the genetics of the disease in individual patients and biomarkers of drugsensitivity and resistance. The third step is to design clinical trials to providedata for the development and verification of personalized therapies. This isan active area of research and several important developments are reviewedin Lai and Lavori (2011) and Lai et al. (2012). It is an ongoing project of oursat Stanford’s Center for Innovative Study Design.Despite the sequential nature of Phase I–III trials, in which Phase I studiesare used to determine a safe dose or dosage regimen, Phase II trials are usedto evaluate the efficacy of the drug for particular indications (endpoints) inpatients with the disease, and Phase III trials aim to demonstrate the effectivenessof the drug for its approval by the regulatory agency, the trials areoften planned separately, treating each trial as an independent study whosedesign depends on studies in previous phases. Since the sample sizes of thetrials are often inadequate because of separate planning, an alternative strategyis to expand a trial seamlessly from one phase into the next phase, asin the Phase II–III cancer trial designs of Inoue et al. (2002) and Lai et al.


T.L. Lai 373(2012). The monograph by Bartroff et al. (2012) gives an overview of recentdevelopments in sequential and adaptive designs of Phase I, II, and III clinicaltrials, and statistical analysis and inference following these trials.33.4 Credit portfolios and dynamic empirical Bayes infinanceThe 2007–08 financial crisis began with unpreparedly high default rates of subprimemortgage loans in 2007 and culminated in the collapse of large financialinstitutions such as Bear Stearns and Lehman Brothers in 2008. Parallel tothe increasing volume of subprime mortgage loans whose value was estimatedto be $1.3 trillion by March 2007, an important development in financial marketsfrom 2000 to 2007 was the rapid growth of credit derivatives, culminatingin $32 trillion worth of notional principal for outstanding credit derivatives byDecember 2009. These derivative contracts are used to hedge against creditloss of either a single corporate bond, as in a credit default swap (CDS), ora portfolio of corporate bonds, as in a cash CDO (collateralized debt obligation),or variant thereof called synthetic CDO. In July 2007, Bear Stearnsdisclosed that two of its subprime hedge funds which were invested in CDOshad lost nearly all their value following a rapid decline in the subprime mortgagemarket, and Standard & Poor’s (S&P) downgraded the company’s creditrating. In March 2008, the Federal Reserve Bank of New York initially agreedto provide a $25 billion collateralized 28-day loan to Bear Stearns, but subsequentlychanged the deal to make a $30 billion loan to JPMorgan Chase topurchase Bear Stearns. Lehman Brothers also suffered unprecedented lossesfor its large positions in subprime and other lower-rated mortgage-backed securitiesin 2008. After attempts to sell it to Korea Development Bank and thento Bank of America and to Barclays failed, it filed for Chapter 11 bankruptcyprotection on September 15, 2008, making the largest bankruptcy filing, withover $600 billion in assets, in US history. A day after Lehman’s collapse, AmericanInternational Group (AIG) needed bailout by the Federal Reserve Bank,which gave the insurance company a secured credit facility of up to $85 billionto enable it to meet collateral obligations after its credit ratings were downgradedbelow AA, in exchange for a stock warrant for 79.9% of its equity.AIG’s London unit had sold credit protection in the form of CDS and CDOto insure $44 billion worth of securities originally rated AAA. As Lehman’sstock price was plummeting, investors found that AIG had valued its subprimemortgage-backed securities at 1.7 to 2 times the values used by Lehman andlost confidence in AIG. Its share prices had fallen over 95% by September 16,2008. The “contagion” phenomenon, from increased default probabilities ofsubprime mortgages to those of counterparties in credit derivative contractswhose values vary with credit ratings, was mostly neglected in the models of


374 Statistics in a new erajoint default intensities that were used to price CDOs and mortgage-backedsecurities. These models also failed to predict well the “frailty” traits of latentmacroeconomic variables that underlie mortgages and mortgage-backedsecurities.For a multiname credit derivative such as CDO involving k firms, it isimportant to model not only the individual default intensity processes butalso the joint distribution of these processes. Finding tractable models thatcan capture the key features of the interrelationships of the firms’ default intensitieshas been an active area of research since intensity-based (also calledreduced-form) models have become a standard approach to pricing the defaultrisk of a corporate bond; see Duffie and Singleton (2003) and Lando (2004).Let Φ denote the standard Normal distribution function, and let G i be the distributionfunction of the default time τ i for the ith firm, where i ∈{1,...,M}.Then Z i =Φ −1 {G i (τ i )} is standard Normal. Li (2000) went on to assume that(Z 1 ,...,Z M ) is multivariate Normal and specifies its correlation matrix Γ byusing the correlations of the stock returns of the M firms. This is an exampleof a copula model; see, e.g., Genest and Favre (2007) or Genest and Nešlehová(2012).Because it provides a simple way to model default correlations, the Gaussiancopula model quickly became a widely used tool to price CDOs and othermulti-name credit derivatives that were previously too complex to price, despitethe lack of convincing argument to connect the stock return correlationsto the correlations of the Normally distributed transformed default times. Ina commentary on “the biggest financial meltdown since the Great Depression,”Salmon (2012) mentioned that the Gaussian copula approach, which“looked like an unambiguously positive breakthrough,” was used uncriticallyby “everybody from bond investors and Wall Street banks to rating agenciesand regulators” and “became so deeply entrenched — and was makingpeople so much money — that warnings about its limitations were largely ignored.”In the wake of the financial crisis, it was recognized that better albeitless tractable models of correlated default intensities are needed for pricingCDOs and risk management of credit portfolios. It was also recognized thatsuch models should include relevant firm-level and macroeconomic variablesfor default prediction and also incorporate frailty and contagion.The monograph by Lai and Xing (2013) reviews recent works on dynamicfrailty and contagion models in the finance literature and describes a new approachinvolving dynamic empirical Bayes and generalized linear mixed models(GLMM), which have been shown to compare favorably with the considerablymore complicated hidden Markov models for the latent frailty processes or theadditive intensity models for contagion. The empirical Bayes (EB) methodology,introduced by Robbins (1956) and Stein (1956), considers n independentand structurally similar problems of inference on the parameters θ i from observeddata Y 1 ,...,Y n ,whereY i has probability density f(y|θ i ). The θ i areassumed to have a common prior distribution G that has unspecified hyperparameters.Letting d G (y) betheBayesdecisionrule(withrespecttosome


T.L. Lai 375loss function and assuming known hyperparameters) when Y i = y is observed,the basic principle underlying EB is that a parametric form of G (as in Stein,1956) or even G itself (as in Robbins, 1956) can be consistently estimatedfrom Y 1 ,...,Y n ,leadingtotheEBruledĜ. DynamicEBextendsthisideatolongitudinal data Y it ; see Lai et al. (2013). In the context of insurance claimsover time for n contracts belonging to the same risk class, the conventionalapproach to insurance rate-making (called “evolutionary credibility” in actuarialscience) assumes a linear state-space for the longitudinal claims data sothat the Kalman filter can be used to estimate the claims’ expected values,which are assumed to form an autoregressive time series. Applying the EBprinciple to the longitudinal claims from the n insurance contracts, Lai andSun (2012) have developed a class of linear mixed models as an alternativeto linear state-space models for evolutionary credibility and have shown thatthe predictive performance is comparable to that of the Kalman filter whenthe claims are generated by a linear state-space model. This approach can bereadily extended to GLMMs not only for longitudinal claims data but also fordefault probabilities of n firms, incorporating frailty, contagion, and regimeswitching. Details are given in Lai and Xing (2013).33.5 Statistics in the new era of financeStatistics has been assuming an increasingly important role in quantitativefinance and risk management after the financial crisis, which exposed theweakness and limitations of traditional financial models, pricing and hedgingtheories, risk measures and management of derivative securities and structuredproducts. Better models and paradigms, and improvements in risk managementsystems are called for. Statistics can help meet these challenges, whichin turn may lead to new methodological advances for the field.The Dodd–Frank Act and recent financial reforms in the European Unionand other countries have led to new financial regulations that enforce transparencyand accountability and enhance consumer financial protection. Theneed for good and timely data for risk management and regulatory supervisionis well recognized, but how to analyze these massive datasets and usethem to give early warning and develop adaptive risk control strategies isachallengingstatisticalproblemthatrequiresdomainknowledgeandinterdisciplinarycollaboration. The monograph by Lai and Xing (2013) describessome recent research in sequential surveillance and early warning, particularlyfor systemic risk which is the risk of a broad-based breakdown in the financialsystem as experienced in the recent financial crisis. It reviews the criticalfinancial market infrastructure and core-periphery network models for mathematicalrepresentation of the infrastructure; such networks incorporate thetransmission of risk and liquidity to and from the core and periphery nodes


376 Statistics in a new eraof the network. After an overview of the extensive literature in statistics andengineering on sequential change-point detection and estimation, statisticalprocess control, and stochastic adaptive control, it discusses how these methodscan be modified and further developed for network models to come upwith early warning indicators for financial instability and systemic failures.This is an ongoing research project with colleagues at the Financial and RiskModeling Institute at Stanford, which is an interdisciplinary research centerinvolving different schools and departments.Besides risk management, the field of statistics has played an importantrole in algorithmic trading and quantitative investment strategies, which havegained popularity after the financial crisis when hedge funds employing thesestrategies outperformed many equity indices and other risky investment options.Statistical modeling of market microstructure and limit-order book dynamicsis an active area of research; see for example, Ait-Sahalia et al. (2005)and Barndorff-Nielsen et al. (2008). Even the foundational theory of meanvarianceportfolio optimization has received a new boost from contemporaneousdevelopments in statistics during the past decade. A review of thesedevelopments is given by Lai et al. (2011) who also propose a new statisticalapproach that combines solution of a basic stochastic optimization problemwith flexible modeling to incorporate time series features in the analysis ofthe training sample of historical data.33.6 ConclusionThere are some common threads linking statistical modeling in finance andhealth care for the new era. One is related to “Big Data” for regulatory supervision,risk management and algorithmic trading, and for emerging healthcare systems that involve electronic medical records, genomic and proteomicbiomarkers, and computer-assisted support for patient care. Another is relatedto the need for collaborative research that can integrate statistics with domainknowledge and subject-matter issues. A third thread is related to dynamicpanel data and empirical Bayes modeling in finance and insurance. Healthinsurance reform is a major feature of the 2010 Affordable Care Act, and hasled to a surge of interest in the new direction of health insurance in actuarialscience. In the US, insurance contracts are predominantly fee-for-service(FFS). In such arrangements the FFS contracts offer perverse incentives forproviders, typically resulting in over-utilization of medical care. Issues withFFS would be mitigated with access to more reliable estimates of patient heterogeneitybased on administrative claims information. If the health insurercan better predict future claims, it can better compensate providers for caringfor patients with complicated conditions. This is an area where the field ofstatistics can have a major impact.


T.L. Lai 377ReferencesAit-Sahalia, Y., Mykland, P., and Zhang, L. (2005). How often to sample acontinuous-time process in the presence of market microstructure noise.Review of Financial Studies, 18:351–416.ALLHAT Collaborative Research Group (2002). Major outcomes in highriskhypertensive patients randomized to angiotensin-converting enzymeinhibitor or calcium channel blocker vs diuretic: The Antihypertensiveand Lipid-Lowering Treatment to Prevent Heart Attack Trial (ALLHAT).Journal of the American Medical Association, 288:2981–2997.Barndorff-Nielsen, O.E., Hansen, P.R., Lunde, A., and Shephard, N. (2008).Designing realized kernels to measure the ex post variation of equity pricesin the presence of noise. Econometrica, 76:1481–1536.Bartroff, J., Lai, T.L., and Shih, M.-C. (2012). Sequential Experimentationin Clinical Trials: Design and Analysis. Springer,NewYork.Druker, B.J., Guilhot, F., O’Brien, S.G., Gathmann, I., Kantarjian, H., Gattermann,N., Deininger, M.W.N., Silver, R.T., Goldman, J.M., Stone,R.M., Cervantes, F., Hochhaus, A., Powell, B.L., Gabrilove, J.L., Rousselot,P., Reiffers, J., Cornelissen, J.J., Hughes, T., Agis, H., Fischer, T.,Verhoef, G., Shepherd, J., Saglio, G., Gratwohl, A., Nielsen, J.L., Radich,J.P., Simonsson, B., Taylor, K., Baccarani, M., So, C., Letvak, L., Larson,R.A., and IRIS Investigators (2006). Five-year follow-up of patientsreceiving imatinib for chronic myeloid leukemia. New England Journal ofMedicine, 355:2408–2417.Duffie, D. and Singleton, K.J. (2003). Credit Risk: Pricing, Measurement,and Management. Princeton University Press, Princeton, NJ.Fiore, L., Brophy, M., D’Avolio, L., Conrad, C., O’Neil, G., Sabin, T., Kaufman,J., Hermos, J., Swartz, S., Liang, M., Gaziano, M., Lawler, E.,Ferguson, R., Lew, R., Doras, G., and Lavori, P. (2011). A point-of-careclinical trial comparing insulin administered using a sliding scale versusaweight-basedregimen.Clinical Trials, 8:183–195.Genest, C. and Favre, A.-C. (2007). Everything you always wanted to knowabout copula modeling but were afraid to ask. Journal of Hydrologic Engineering,12:347–368.Genest, C. and Nešlehová, J. (2012). Copulas and copula models. Encyclopediaof Environmetrics, 2nd edition. Wiley, Chichester, 2:541–553.Inoue, L.Y., Thall, P.F., and Berry, D.A. (2002). Seamlessly expanding arandomized phase II trial to phase III. Biometrics, 58:823–831.


378 Statistics in a new eraLai, T.L. (2001). Sequential analysis: Some classical problems and new challenges(with discussion). Statistica Sinica, 11:303–408.Lai, T.L. and Lavori, P.W. (2011). Innovative clinical trial designs: Towarda21st-centuryhealthcaresystem.Statistics in Biosciences, 3:145–168.Lai, T.L., Lavori, P.W., and Shih, M.-C. (2012). Sequential design ofPhase II–III cancer trials. Statistics in Medicine, 31:1944–1960.Lai, T.L., Lavori, P.W., Shih, M.-C., and Sikic, B.I. (2012). Clinical trial designsfor testing biomarker-based personalized therapies. Clinical Trials,9:141–154.Lai, T.L. and Liao, O.Y.-W. (2012). Efficient adaptive randomization andstopping rules in multi-arm clinical trials for testing a new treatment.Sequential Analysis, 31:441–457.Lai, T.L., Su, Y., and Sun, K. (2013). Dynamic empirical Bayes models andtheir applications to longitudinal data analysis and prediction. StatisticaSinica. To appear.Lai, T.L. and Sun, K.H. (2012). Evolutionary credibility theory: A generalizedlinear mixed modeling approach. North American Actuarial Journal,16:273–284.Lai, T.L. and Xing, H. (2013). Active Risk Management: Financial Modelsand Statistical Methods. Chapman&Hall,London.Lai, T.L., Xing, H., and Chen, Z. (2011). Mean-variance portfolio optimizationwhen means and covariances are unknown. The Annals of AppliedStatistics, 5:798–823.Lando, D. (2004). Credit Risk Modeling: Theory and Applications. PrincetonUniversity Press, Princeton, NJ.Li, D.X. (2000). On default correlation: A copula function approach. Journalof Fixed Income, 9:43–54.Robbins, H. (1956). An empirical Bayes approach to statistics. In Proceedingsof the Third Berkeley Symposium on Mathematical Statistics and Probability,1954–1955, vol. 1. University of California Press, Berkeley, CA,pp. 157–163.Rush, A., Fava, M., Wisniewski, S., Lavori, P., Trivedi, M., Sackeim, H.,Thase, M., Nierenberg, A., Quitkin, F., Kashner, T., Kupfer, D., Rosenbaum,J., Alpert, J., Stewart, J., McGrath, P., Biggs, M., Shores-Wilson,K., Lebowitz, B., Ritz, L., Niederehe, G. and STAR D Investigators Group(2004). Sequenced treatment alternatives to relieve depression (STAR*D):Rationale and design. Controlled Clinical Trials, 25:119–142.


T.L. Lai 379Salmon, F. (2012). The formula that killed Wall Street. Significance, 9(1):16–20.Shih, M.-C. and Lavori, P.W. (2013). Sequential methods for comparativeeffectiveness experiments: Point of care clinical trials. Statistica Sinica,23:1775–1791.Stein, C. (1956). Inadmissibility of the usual estimator for the mean of amultivariate normal distribution. In Proceedings of the Third BerkeleySymposium on Mathematical Statistics and Probability, 1954–1955,vol.1.University of California Press, Berkeley, CA, pp. 197–206.Stukel, T.A., Fisher, E.S., Wennberg, D.E., Alter, D.A., Gottlieb, D.J., andVermeulen, M.J. (2007). Analysis of observational studies in the presenceof treatment selection bias: Effects of invasive cardiac management onAMI survival using propensity score and instrumental variable methods.Journal of the American Medical Association, 297:278–285.Thall, P.F., Logothetis, C., Pagliaro, L.C., Wen, S., Brown, M.A.,Williams, D., and Millikan, R.E. (2007). Adaptive therapy for androgenindependentprostate cancer: A randomized selection trial of four regimens.Journal of the National Cancer Institute, 99:1613–1622.


34Meta-analyses: Heterogeneity can be agood thingNan M. LairdDepartment of BiostatisticsHarvard School of Public Health, Boston, MAMeta-analysis seeks to summarize the results of a number of different studieson a common topic. It is widely used to address important and dispirit problemsin public health and medicine. Heterogeneity in the results of differentstudies is common. Sometimes perceived heterogeneity is a motivation for theuse of meta-analysis in order to understand and reconcile differences. In othercases the presence of heterogeneity is regarded as a reason not to summarizeresults. An important role for meta-analysis is the determination of designand analysis factors that influence the outcome of studies. Here I review someof the controversies surrounding the use of meta-analysis in public health andmy own experience in the field.34.1 IntroductionMeta-analysis has become a household word in many scientific disciplines.The uses of meta-analysis vary considerably. It can be used to increase power,especially for secondary endpoints or when dealing with small effects, to reconciledifferences in multiple studies, to make inferences about a very particulartreatment or intervention, to address more general issues, such as what isthe magnitude of the placebo effect or to ask what design factors influencethe outcome of research? In some cases, a meta-analysis indicates substantialheterogeneity in the outcomes of different studies.With my colleague, Rebecca DerSimonian, I wrote several articles on metaanalysisin the early 1980s, presenting a method for dealing with heterogeneity.In this paper, I provide the motivation for this work, advantages and difficultieswith the method, and discuss current trends in handling heterogeneity.381


382 Meta-analyses and heterogeneity34.2 Early years of random effects for meta-analysisI first learned about meta-analysis in the mid-1970s while I was still a graduatestudent. Meta-analysis was being used then and even earlier in the socialsciences to summarize the effectiveness of treatments in psychotherapy, the effectsof class size on educational achievement, experimenter expectancy effectsin behavioral research and results of other compelling social science researchquestions.In the early 1980s, Fred Mosteller introduced me to Warner Slack and DougPorter at Harvard Medical School who had done a meta-analysis on the effectivenessof coaching students for the Scholastic Aptitude Tests (SAT). Fredwas very active in promoting the use of meta-analysis in the social sciences,and later the health and medical sciences. Using data that they collected onsixteen studies evaluating coaching for verbal aptitude, and thirteen on mathaptitude, Slack and Porter concluded that coaching is effective on raising aptitudescores, contradicting the principle that the SATs measure “innate” ability(Slack and Porter, 1980).What was interesting about Slack and Porter’s data was a striking relationshipbetween the magnitude of the coaching effect and the degree of controlfor the coached group. Many studies evaluated only coached students andcompared their before and after coaching scores with national norms providedby the ETS on the average gains achieved by repeat test takers. These studiestended to show large gains for coaching. Other studies used conveniencesamples as comparison groups, and some studies employed either matching orrandomization. This last group of studies showed much smaller gains. Fred’sprivate comment on their analysis was “Of course coaching is effective; otherwise,we would all be out of business. The issue is what kind of evidence isthere about the effect of coaching?”The science (or art) of meta-analysis was then in its infancy. Eugene Glasscoined the phrase meta-analysis in 1976 to mean the statistical analysis ofthe findings of a collection of individual studies (Glass, 1976). Early papers inthe field stressed the need to systematically report relevant details on studycharacteristics, not only about design of the studies, but characteristics ofsubjects, investigators, interventions, measures, study follow-up, etc. Howevera formal statistical framework for creating summaries that incorporated heterogeneitywas lacking. I was working with Rebecca DerSimonian on randomeffects models at the time, and it seemed like a natural approach that could beused to examine heterogeneity in a meta-analysis. We published a follow-uparticle to Slack and Porter in the Harvard Education Review that introduceda random effects approach for meta-analysis (DerSimonian and Laird, 1983).The approach followed that of Cochran who wrote about combining the effectsof different experiments with measured outcomes (Cochran, 1954). Cochranintroduced the idea that the observed effect of each study could be partitioned


N.M. Laird 383into the sum of a “true” effect plus a sampling (or within-study) error. An estimateof the within-study error should be available from each individual study,and can be used to get at the distribution of “true” effects. Many of the socialscience meta-analyses neglected to identify within-study error, and in somecases, within-study error variances were not reported. Following Cochran, weproposed a method for estimating the mean of the “true” effects, as well as thevariation in “true” effects across studies. We assumed that the observed studyeffect was the difference in two means on a measured scale and also assumednormality for the distribution of effects and error terms. However normalityis not necessary for validity of the estimates.The random effects method for meta-analysis is now widely used, but introducesstumbling blocks for some researchers who find the concept of adistribution of effects for treatments or interventions unpalatable. A majorconceptual problem is imagining the studies in the analysis as a sample froma recognizable population. As discussed in Laird and Mosteller (1990), absenceof a sampling frame to draw a random sample is a ubiquitous problemin scientific research in most fields, and so should not be considered as a specialproblem unique to meta-analysis. For example, most investigators treatpatients enrolled in a study as a random sample from some population of patients,or clinics in a study as a random sample from a population of clinicsand they want to make inferences about the population and not the particularset of patients or clinics. This criticism does not detract from the utilityof the random effects method. If the results of different research programsall yield similar results, there would not be great interest in a meta-analysis.The principle behind a random effects approach is that a major purpose ofmeta-analysis is to quantify the variation in the results, as well as provide anoverall mean summary.Using our methods to re-analyze Slack and Porter’s results, we concludedthat any effects of coaching were too small to be of practical importance (Der-Simonian and Laird, 1983). Although the paper attracted considerable mediaattention (articles about the paper were published in hundreds of US newspapers), the number of citations in the scientific literature is comparativelymodest.34.3 Random effects and clinical trialsIn contrast to our paper on coaching, a later paper by DerSimonian andmyself has been very highly cited, and led to the moniker “DerSimonian andLaird method” when referring to a random-effects meta-analysis (DerSimonianand Laird, 1986). This paper adapted the random effects model for metaanalysisof clinical trials; the basic idea of the approach is the same, but herethe treatment effect was assumed to be the difference in Binomial cure rates


384 Meta-analyses and heterogeneitybetween a treated and control group. Taking the observed outcome to be adifference in Binomial cure rates raised various additional complexities. Thedifference in cure rates is more relevant and interpretable in clinical trials,but statistical methods for combining a series of 2 × 2 tables usually focus onthe odds ratio. A second issue is that the Binomial mean and the varianceare functionally related. As a result, the estimate of the within-study variance(which was used to determine the weight assigned to each study) is correlatedwith the estimated study effect size. We ignored this problem, with the resultthat the method can be biased, especially with smaller samples, and betterapproaches are available (Emerson et al., 1996; Wang et al., 2010). The finalissue is estimating and testing for heterogeneity among the results, and howchoice of effect measure (rate difference, risk ratio, or odds ratio) can affectthe results. Studies have shown that choice of effect measure has relativelylittle effect on assessment of heterogeneity (Berlin et al., 1989).The clinical trials setting can be very different from data synthesis in thesocial sciences. The endpoint of interest may not correspond to the primaryendpoint of the trial and the number of studies can be much smaller. Forexample, many clinical trials may be designed to detect short term surrogateendpoints, but are under-powered to detect long term benefits or importantside effects (Hine et al., 1989). In this setting a meta-analysis can be the bestsolution for inferences about long term or untoward side effects. Thus theprimary purpose of the meta-analysis may be to look at secondary endpointswhen the individual studies do not have sufficient power to detect the effects ofsecondary endpoints. Secondly, it is common to restrict meta-analyses to randomizedcontrolled trials (RTCs) and possibly also to trials using the sametreatment. This is in direct contrast to meta-analyses that seek to answervery broad questions; such analyses can include many types of primary studies,studies with different outcome measures, different treatments, etc. In oneof the earliest meta-analyses, Beecher (1955) sought to measure the “placebo”effect by combining data from 15 clinical studies of pain for a variety of differentindications treated by different placebo techniques.Yusuf et al. (1985) introduced a “fixed effects” method for combining theresults of a series of controlled clinical trials. They proposed a fixed-effectapproach to the analysis of all trials ever done, published or not, that presumeswe are only interested in the particular set of studies we have foundin our search (which is in principle all studies ever done). In practice, statisticiansare rarely interested in only the particular participants in our datacollection efforts, but want findings that can be generalized to similar participants,whether they be clinics, hospitals, patients, investigators, etc. In fact,the term “fixed” effects is sometimes confused with the equal effects setting,where the statistical methods used implicitly assume that the “true” effectsfor each study are the same. Yusuf et al. (1985) may have partly contributedto this confusion by stating that their proposed estimate and standard errorof the overall effect do not require equality of the effects, but then cautioningthat the interpretation of the results is restricted to the case where the effects


N.M. Laird 385are “approximately similar.” As noted by Cochran, ignoring variation in theeffects of different studies generally gives smaller standard errors that do notaccount for this variance.34.4 Meta-analysis in genetic epidemiologyIn recent years, meta-analysis is increasingly used in a very different type ofstudy: large scale genetic epidemiology studies. Literally thousands of reportsare published each year on associations between genetic variants and diseases.These reports may look at only a few genetic variants in specified genes,or they may test hundreds of thousands of variants in all the chromosomesin Genome Wide Association Scans (GWAS). These GWAS reports are lessthan ten years old, but now are the standard approach for gene “discovery”in complex diseases. Because there are so many variants being tested, andbecause the effect sizes tend to be quite small, replication of positive findings inindependent samples has been considered a requirement for publication rightfrom the beginning. But gradually there has been a shift from reporting theresults of the primary study and a small but reasonable number of replications,to pooling the new study and the replication studies, and some or all availableGWAS studies using the same endpoint.In contrast to how the term meta-analysis is used elsewhere, the termmeta-analysis is often used in the genetic epidemiology literature to describewhat is typically called “pooling” in the statistical literature, that is, analyzingindividual level data together as a single study, stratifying or adjustingby source. The pooling approach is popular and usually viewed as more desirable,despite the fact that studies (Lin and Zeng, 2010) have shown thatcombining the summary statistics in a meta-analysis is basically equivalentto pooling under standard assumptions. I personally prefer meta-analysis becauseit enables us to better account for heterogeneity and inflated varianceestimates.Like all epidemiological studies, the GWAS is influenced by many designfactors which will affect the results, and they do have a few special featureswhich impact on the use of pooling or meta-analysis, especially in the contextof heterogeneity. First, the cost of genotyping is great, and to make thesestudies affordable, samples have been largely opportunistic. Especially earlyon, most GWAS used pre-existing samples where sufficient biological materialwas available for genotyping, and the traits of interest were already available.This can cause considerable heterogeneity. For example, I was involved inthe first GWAS to find association between a genetic variant and obesity asmeasured by body mass index (BMI); it illustrates the importance of studydesign.


386 Meta-analyses and heterogeneityThe original GWAS (with only 100,000 genetic markers) was carried outin the Framingham Heart Study (Herbert et al., 2006). We used a novel approachto the analysis and had data from five other cohorts for replication.All but one of the five cohorts reproduced the result. This was an easy studyto replicate, because virtually every epidemiological study of disease has alsomeasured height and weight, so that BMI is available, even though the studywas not necessarily designed to look at factors influencing BMI. In addition,replication required genotyping only one genetic marker. In short order, hundredsof reports appeared in the literature, many of them non-replications.A few of us undertook a meta-analysis of these reported replications or nonreplicationswith the explicit hypothesis that the study population influencesthe results; over 76,000 subjects were included in this analysis. We consideredthree broad types of study populations. The first is general population cohortwhere subjects were drawn from general population samples without restrictingparticipants on the basis of health characteristics. Examples of this are theFramingham Heart Study and a German population based sample (KORA).The second is healthy population samples where subjects are drawn from populationsknown to be healthier than those in the general population, typicallyfrom some specific work force. The third category of studies included thosespecifically designed to study obesity; those studies used subjects chosen onthe basis of obesity, including case-control samples where obese and non-obesesubjects were selected to participate and family-controlled studies that usedonly obese subjects and their relatives.In agreement with our hypothesis, the strongest result we found was thatthe effect varied by study population. The general population samples andthe selected samples replicated the original study in finding a significant association,but the healthy population studies showed no evidence of an effect.This is a critically important finding. Many fields have shown that randomizedversus non-randomized, blinded versus unblinded, etc., can have majoreffects, but this finding is a bit different. Using healthy subjects is not intrinsicallya poor design choice, but may be so for many common, complexdisorders. One obvious reason is that genetic studies of healthy subjects maylack sufficient variation in outcome to have much power. More subtle factorsmight include environmental or other genetic characteristics which interactto modify the gene effect being investigated. In any event, it underscores thedesirability of assessing and accounting for heterogeneity in meta-analyses ofgenetic associations.AsecondissueisrelatedtothefactthattherearestillrelativelyfewGWAS of specific diseases, so many meta-analyses involve only a handful ofstudies. The random effects approach of DerSimonian and Laird does not workwell with only a handful of studies because it estimates a variance, wherethe sample size for the variance estimate is the number of studies. Finallyfor a GWAS, where hundreds of thousands of genetic markers are tested,often on thousands of subjects, any meta-analysis method needs to be easilyimplemented in order to be practically useful. Software for meta-analysis is


N.M. Laird 387included in the major genetic analysis statistical packages (Evangelou andIoannidis, 2013), but most software packages only implement a fixed effectsapproach. As a result, the standard meta-analyses of GWAS use the fixedeffects approach and potentially overstate the precision in the presence ofheterogeneity.34.5 ConclusionsThe DerSimonian and Laird method has weathered a great deal of criticism,and undoubtedly we need better methods for random effects analyses, especiallywhen the endpoints of interest are proportions and when the numberof studies being combined is small and or the sample sizes within each studyare small. Most meta-analyses involving clinical trials acknowledge the importanceof assessing variation in study effects, and new methods for quantifyingthis variation are widely used (Higgins and Thompson, 2002). In addition,meta-regression methods for identifying factors influencing heterogeneity areavailable (Berkey et al., 1995; Thompson and Higgins, 2002); these can be usedto form subsets of studies which are more homogeneous. There is an extensiveliterature emphasizing the necessity and desirability of assessing heterogeneity,and many of these reinforce the role of study design in connection withheterogeneity. The use of meta-analysis in genetic epidemiology to find diseasegenes is still relatively new, but the benefits are widely recognized (Ioannidiset al., 2007). Better methods for implementing random effects methods withasmallnumberofstudieswillbeespeciallyusefulhere.ReferencesBeecher, H.K. (1955). The powerful placebo. Journal of the American MedicalAssociation, 159:1602–1606.Berkey, C.S., Hoaglin, D.C., Mosteller, F., and Colditz, G.A. (1995). Arandom-effects regression model for meta-analysis. Statistics in Medicine,14:395–411.Berlin, J.A., Laird, N.M., Sacks, H.S., and Chalmers, T.C. (1989). A comparisonof statistical methods for combining event rates from clinical trials.Statistics in Medicine, 8:141–151.Cochran, W.G. (1954). The combination of estimates from different experiments.Biometrics, 10:101–129.


388 Meta-analyses and heterogeneityDerSimonian, R. and Laird, N.M. (1983). Evaluating the effect of coachingon SAT scores: A meta-analysis. Harvard Educational Review, 53:1–15.DerSimonian, R. and Laird, N.M. (1986). Meta-analysis in clinical trials.Controlled Clinical Trials, 7:177–188.Emerson, J.D., Hoaglin, D.C., and Mosteller, F. (1996). Simple robust proceduresfor combining risk differences in sets of 2 × 2tables.Statistics inMedicine, 15:1465–1488.Evangelou, E. and Ioannidis, J.P.A. (2013). Meta-analysis methods forgenome-wide association studies and beyond. Nature Reviews Genetics,14:379–389.Glass, G.V. (1976). Primary, secondary, and meta-analysis of research. EducationalResearcher, 5:3–8.Herbert, A., Gerry, N.P., McQueen, M.B., Heid, I.M., Pfeufer, A., Illig, T.,Wichmann, H.-E., Meitinger, T., Hunter, D., Hu, F.B. et al. (2006). Acommon genetic variant is associated with adult and childhood obesity.Science, 312:279–283.Higgins, J. and Thompson, S.G. (2002). Quantifying heterogeneity in a metaanalysis.Statistics in Medicine, 21:1539–1558.Hine, L., Laird, N.M., Hewitt, P., and Chalmers, T. (1989). Meta-analyticevidence against prophylactic use of lidocaine in acute myocardial infarction.Archives of Internal Medicine, 149:2694–2698.Ioannidis, J.P.A., Patsopoulos, N.A., and Evangelou, E. (2007). Heterogeneityin meta-analyses of genome-wide association investigations. PloS one,2:e841.Laird, N.M. and Mosteller, F. (1990). Some statistical methods for combiningexperimental results. International Journal of Technological Assessmentof Health Care, 6:5–30.Lin, D. and Zeng, D. (2010). Meta-analysis of genome-wide association studies:No efficiency gain in using individual participant data. Genetic Epidemiology,34:60–66.Slack, W. and Porter, D. (1980). The scholastic aptitude test: A criticalappraisal. Harvard Educational Review, 66:1–27.Thompson, S.G. and Higgins, J. (2002). How should meta-regression analysesbe undertaken and interpreted? Statistics in Medicine, 21:1559–1573.Wang, R., Tian, L., Cai, T., and Wei, L. (2010). Nonparametric inference procedurefor percentiles of the random effects distribution in meta-analysis.The Annals of Applied Statistics, 4:520–532.


N.M. Laird 389Yusuf, S., Peto, R., Lewis, J., Collins, R., and Sleight, P. (1985). Beta blockadeduring and after myocardial infarction: an overview of the randomizedtrials. Progress in Cardiovascular Diseases, 27:335–371.


35Good health: Statistical challenges inpersonalizing disease preventionAlice S. WhittemoreDepartment of Health Research and PolicyStanford University, Stanford, CAIncreasingly, patients and clinicians are basing healthcare decisions on statisticalmodels that use a person’s covariates to assign him/her a probability ofdeveloping a disease in a given future time period. In this chapter, I describesome of the statistical problems that arise when evaluating the accuracy andutility of these models.35.1 IntroductionRising health care costs underscore the need for cost-effective disease preventionand control. To achieve cost-efficiency, preventive strategies must focus onindividuals whose genetic and lifestyle characteristics put them at highest risk.To identify these individuals, statisticians and public health professionals aredeveloping personalized risk models for many diseases and other adverse outcomes.The task of checking the accuracy and utility of these models requiresnew statistical methods and new applications for existing methods.35.2 How do we personalize disease risks?We do this using a personalized risk model, which is an algorithm that assignsa person a probability of developing an adverse outcome in a given future timeperiod (say, five, ten or twenty years). The algorithm combines his/her valuesfor a set of risk-associated covariates with regional incidence and mortalitydata and quantitative evidence of the covariates’ effects on risk of the outcome.391


392 Good health: Statistical challengesFIGURE 35.1Genetic tests sold directly to consumers for medical conditions.For example, the Breast Cancer Risk Assessment Tool (BCRAT) assigns awoman a probability of developing breast cancer in the next five years, usingher self-reported risk factors, relative risk estimates obtained from randomizedtrials or observational studies, and national rates for breast cancer incidenceand death from all causes (Costantino et al., 1999; Gail et al., 1989).Personal risk models play important roles in the practice of medicine, asstandards for clinical care become increasingly tailored to patients’ individualcharacteristics and preferences. The need to evaluate risk models is increasingas personalized medicine evolves. A PUBMED search using the key wordsvalidating risk models produced nearly 4000 hits, indicating substantial interestin this topic in current medical practice. Moreover there are now morethan 370 online direct-to-consumer genetic tests with risk assessments for adversehealth outcomes (Figure 35.1). The Food and Drug Administration isconsidering regulating these assessments, so we need reliable methods to validatethem.Determining the probability of a future adverse outcome for a particularperson is not unlike determining the chance of a hurricane or earthquake in aparticular area during a given time period. Not surprisingly therefore, manyof the statistical problems involved in developing and evaluating risk modelshave intrigued meteorologists and seismologists for decades, and their findings


A.S. Whittemore 393form a useful foundation for this work; see, e.g., Brier (1950), Hsu and Murphy(1986), Murphy (1973), and Wilks (1995).35.3 How do we evaluate a personal risk model?Risk models for long-term future outcomes are commonly assessed with respectto two attributes. Their calibration reflects how well their assigned risksagree with observed outcome occurrence within subgroups of the population.Their discrimination (also called precision or resolution) reflects how well theydistinguish those who ultimately do and do not develop the outcome. Goodcalibration does not imply good discrimination. For example, if the actualdisease risks of a population show little inter-personal variation, discriminationwill be poor even for a perfectly calibrated risk model. Conversely, gooddiscrimination does not imply good calibration. Discrimination depends onlyon the ranks of a model’s assigned risks, so any rank-invariant transformationof a model’s risks will affect its calibration but not its discrimination.An important task is to quantify how much a model’s calibration anddiscrimination can be improved by expanding it with additional covariates,such as newly discovered genetic markers. However, the discrimination of arisk model depends on the distribution of risk-associated covariates in thepopulation of interest. As noted in the previous paragraph, no model candiscriminate well in a population with a homogeneous covariate distribution.Thus while large discrimination gains from adding covariates to a model areinformative (indicating substantial additional risk variation detected by theexpanded model), a small precision gain is less so, as it may merely reflectunderlying risk homogeneity in the population.Several metrics have been proposed to assess and compare models withrespect to their calibration and discrimination. Their usefulness depends onhow they will be used, as shown by the following examples.Example 1. Risk models are used to determine eligibility for randomizedclinical trials involving treatments with serious potential side effects. For instance,the BCRAT model was used to determine eligibility for a randomizedtrial to determine if tamoxifen can prevent breast cancer (Fisher et al., 1998,2005). Because tamoxifen increases the risks of stroke, endometrial cancer anddeep-vein thrombosis, eligibility was restricted to women whose breast cancerrisks were deemed high enough to warrant exposure to these side effects. Thuseligible women were those whose BCRAT-assigned five-year breast cancer riskexceeded 1.67%. For this type of application, a good risk model should yielda decision rule with few false positives, i.e., one that excludes women whotruly are at low breast cancer risk. A model without this attribute could in-


394 Good health: Statistical challengesflict tamoxifen’s side effects on women with little chance of gaining from theexperience.Example 2. Risk models are used to improve the cost-efficiency of preventiveinterventions. For instance, screening with breast magnetic resonance imaging(MRI) detects more breast cancers but costs more and produces more falsepositive scans, compared to mammography. Costs and false positives can bereduced by restricting MRI to women whose breast cancer risk exceeds somethreshold (Plevritis et al., 2006). For this type of application, a good riskmodel should give a classification rule that assigns mammography to thosetruly at low risk (i.e., has a low false positive rate), but also assigns MRI tothose truly at high risk (i.e., has a high true positive rate).Example 3. Risk models are used to facilitate personal health care decisions.Consider, for instance, a postmenopausal woman with osteoporosis who mustchoose between two drugs, raloxifene and alendronate, to prevent hip fracture.Because she has a family history of breast cancer, raloxifene would seem agood choice, since it also reduces breast cancer risk. However she also has afamily history of stroke, and raloxifene is associated with increased stroke risk.To make a rational decision, she needs a risk model that provides accurateinformation about her own risks of developing three adverse outcomes (breastcancer, stroke, hip fracture), and the effects of the two drugs on these risks.The first two examples involve classifying people into “high” and “low”categories; thus they require risk models with low false positive and/or falsenegative rates. In contrast, the third example involves balancing one person’srisks for several different outcomes, and thus it requires risk models whoseassigned risks are accurate enough at the individual level to facilitate rationalhealthcare decisions. It is common practice to summarize a model’s calibrationand discrimination with a single statistic, such as a chi-squared goodness-offittest. However, such summary measures do not reveal subgroups whoserisks are accurately or inaccurately pegged by a model. This limitation can beaddressed by focusing on subgroup-specific performance measures. Evaluatingperformance in subgroups also helps assess a model’s value for facilitating personalhealth decisions. For example, a woman who needs to know her breastcancer risk is not interested in how a model performs for others in the population;yet summary performance measures involve the distribution of covariatesin the entire population to which she belongs.35.4 How do we estimate model performance measures?Longitudinal cohort studies allow comparison of actual outcomes to modelassignedrisks. At entry to a cohort, subjects report their current and past co-


A.S. Whittemore 395variate values. A risk model then uses these baseline covariates to assign eachsubject a risk of developing the outcome of interest during a specified subsequenttime period. For example, the Breast Cancer Family Registry (BCFR),a consortium of six institutions in the United States, Canada and Australia,has been monitoring the vital statuses and cancer occurrences of registry participantsfor more than ten years (John et al., 2004). The New York siteof the BCFR has used the baseline covariates of some 1900 female registryparticipants to assign each of them a ten-year probability of breast cancerdevelopment according to one of several risk models (Quante et al., 2012).These assigned risks are then compared to actual outcomes during follow-up.Subjects who die before outcome occurrence are classified as negative for theoutcome, so those with life-threatening co-morbidities may have low outcomerisk because they are likely to die before outcome development.Using cohort data to estimate outcome probabilities presents statisticalchallenges. For example, some subjects may not be followed for the full riskperiod; instead they are last observed alive and outcome-free after only a fractionof the period. An analysis that excludes these subjects may yield biasedestimates. Instead, censored time-to-failure analysis is needed, and the analysismust accommodate the competing risk of death (Kalbfleisch and Lawless,1998; Kalbfleisch and Prentice, 2002; Putter et al., 2007). Another challengearises when evaluating risk models that include biomarkers obtained fromblood collected at cohort entry. Budgetary constraints may prohibit costlybiomarker assessment for the entire cohort, and cost-efficient sampling designsare needed, such as a nested case-control design (Ernster, 1994), a casecohortdesign (Prentice, 1986), or a two-stage sampling design (Whittemoreand Halpern, 2013).Model calibration is often assessed by grouping subjects into quantiles ofassigned risk, and comparing estimated outcome probability to mean assignedrisk within each quantile. Results are plotted on a graph called an attributiondiagram (AD) (Hsu and Murphy, 1986). For example, the top two panelsof Figure 35.2 show ADs for subjects from the NY-BCFR cohort who havebeen grouped in quartiles of breast cancer risk as assigned by two breastcancer risk models, the BCRAT model and the International Breast CancerIntervention Study (IBIS) model (Tyrer et al., 2004). The null hypothesis ofequality between quantile-specific mean outcome probabilities and mean assignedrisks is commonly tested by classifying subjects into risk groups andapplying a chi-squared goodness-of-fit statistic. This approach has limitations:1) the quantile grouping is arbitrary and varies across cohorts sampled fromthe same population; 2) the averaging of risks over subjects in a quantile canobscure subsets of subjects with poor model fit; 3) when confidence intervalsfor estimated outcome probabilities exclude the diagonal line it is difficult totrouble-shoot the risk model; 4) assuming a chi-squared asymptotic distributionfor the goodness-of-fit statistic ignores the heterogeneity of risks withinquantiles.


396 Good health: Statistical challengesFIGURE 35.2Grouped and individualized goodness-of-fit of BCRAT- and IBIS-assignedbreast cancer risks in NY BCFR data.Some of these limitations are addressed by an alternative approach thatuses nearest-neighbor (NN) methods to estimate an outcome probability foreach set of subjects with a given assigned risk (Akritas, 1994; Heagerty et al.,2000). The NN estimate of actual risk for a subject with assigned risk r isbased on the set of all individuals with risks r ′ such that |G(r) − G(r ′ )|


A.S. Whittemore 397the observed flattening of the estimated outcome probability curves in theright tails is an artifact of the NN method. Such flattening reflects the clumpingof sparse subjects in the tails into the same neighborhood to estimate asingle common outcome probability. New methods are needed to address theseissues.Model discrimination is commonly assessed using the concordance statisticor C-statistic, also called the area under the receiver-operating-characteristiccurve (Hanley and McNeil, 1982; Pepe, 2003). This statistic estimates theprobability that the risk assigned to a randomly sampled individual who developsthe outcome exceeds that of a randomly sampled individual who doesnot. The C-statistic has several limitations. Like all summary statistics, it failsto indicate subgroups for whom a model discriminates poorly, or subgroupsfor which one model discriminates better than another. In addition, patientsand health professionals have difficulty interpreting it. A more informativemeasure is the Case Risk Percentile (CRP), defined for each outcome-positivesubject (case) as the percentile of his/her assigned risk in the distribution ofassigned risks of all outcome-negative subjects. The CRP equals 1 SPV, whereSPV denotes her standardized placement value (Pepe and Cai, 2004; Pepe andLongton, 2005). The CRP can be useful for comparing the discrimination oftwo risk models.For example, Figure 35.3 shows the distribution of CRPs for 81 breastcancer cases in the NY-BCFR data, based on the BCRAT & IBIS models. Eachpoint in the figure corresponds to a subject who developed breast cancer within10 years of baseline. Each of the 49 points above the diagonal represents a casewhose IBIS CRP exceeds her BCRAT CRP (i.e., IBIS better discriminates herrisk from that of non-cases than does BCRAT), and the 32 points below theline represent cases for whom BCRAT discriminates better than IBIS. (Notethat CRPs can be computed for any assigned risk, not just those of cases.)Amodel’sC-statistic is just the mean of its CRPs, averaged over all cases.Importantly, covariates associated with having a CRP above or below thediagonal line can indicate which subgroups are better served by one modelthan the other. The CRPs are individualized measures of model sensitivity.Research is needed to develop alternatives to the C-statistic that are moreuseful for evaluating model discrimination. Further discussion of this issue canbe found in Pepe et al. (2010) and Pepe and Janes (2008).35.5 Can we improve how we use epidemiological datafor risk model assessment?We need better methods to accommodate the inherent limitations of epidemiologicaldata for assessing risk model performance. For example, the subjectsin large longitudinal cohort studies are highly selected, so that findings may


398 Good health: Statistical challengesFIGURE 35.3Scatterplot of BCRAT and IBIS case risk percentiles for NY BCFR data.not be generalizable. Thus we need methods for accommodating bias in estimatedperformance measures due to cohort selection. Also, because large cohortstudies are costly, we need ways to evaluate model discrimination usingcase-control data that is not nested within a cohort. Finally, we need methodsfor developing, applying and evaluating multi-state models for multipleadverse events. The following is a brief description of these problem areas.35.5.1 Cohort selection biasThe covariate distributions of individuals in the general population are notwell represented by those of the highly selected participants in large, long-termcohort studies. For example, we found that a published ovarian cancer riskmodel developed using postmenopausal women in the Nurses’ Health Study(Rosner et al., 2005) was well-calibrated to postmenopausal subjects in theCalifornia Teachers Study (CTS) (Bernstein et al., 2002) but poorly calibratedto those in the Women’s Health Initiative (WHI) (Luo et al., 2011). We foundthat although covariate-specific hazard-ratios are similar in the two cohorts,their covariate distributions are very different: e.g., parity is much higherin WHI than CTS. Moreover the distributions of covariates like education andparity among cohort subjects tend to be more homogeneous than those of thegeneral population. Work is needed to compare the distributions of covariatesamong subjects in cohort studies with those of the general US population, as


A.S. Whittemore 399represented, for example, by participants in one of the cross-sectional studiesconducted by the National Health Interview Survey (NHANES). Methods areneeded to use these distributions to estimate the model performance measureswe would see if the model were applied to subjects whose covariates reflectthose of the general population.35.5.2 Evaluating risk models with case-control dataData from case-control studies nested within a cohort are not useful for evaluatingmodel calibration, which concerns the agreement between a model’sassigned risks and the actual probabilities of adverse outcome occurrencewithin a future risk period. Sampling only the outcome-positive and outcomenegativesubjects (ignoring the time at risk contributed by censored subjects)can lead to severe bias in calibration measures, due to overestimation of outcomeprobabilities (Whittemore and Halpern, 2013). However under certainassumptions, unbiased (though inefficient) estimates of discrimination measurescan be obtained from nested case-control studies. The critical assumptionis that the censoring be uninformative; i.e., that subjects censored at agiven follow-up time are a random sample of all cohort members alive andoutcome-free at that time (Heagerty et al., 2000). This assumption is reasonablefor the type of censoring encountered in most cohort studies. There isaneedtoevaluatetheefficiencylossinestimateddiscriminationmeasuresassociated with excluding censored subjects.However when interest centers on special populations, such as those at highrisk of the outcome, it may not be feasible to find case-control data nestedwithin a cohort to evaluate model discrimination. For example, we may wantto use breast cancer cases and cancer-free control women ascertained in a highriskcancer clinic to determine and compare discrimination of several modelsfor ten-year breast cancer risk. Care is needed in applying the risk models tonon-nested case-control data such as these, and interpreting the results. Tomimic the models’ prospective setting, two steps are needed: 1) the modelsmust assign outcome risks conditional on the absence of death during the riskperiod; and 2) subjects’ covariates must be assessed at a date ten years beforeoutcome assessment (diagnosis date for cases, date of interview for controls).In principle, the data can then be used to estimate ten-year breast cancerprobabilities ignoring the competing risk of death. In practice, the rules forascertaining cases and controls need careful consideration to avoid potentialselection bias (Wacholder et al., 1992).35.5.3 Designing and evaluating models formultiple outcomesValidating risk models that focus on a single adverse outcome (such as developingbreast cancer within ten years) involves estimating a woman’s ten-yearbreast cancer probability in the presence of co-morbidities causing her death


400 Good health: Statistical challengesFIGURE 35.4Graphs showing transition probabilities for transient and absorbing states forbreast cancer (single outcome; left graph), and for breast cancer and stroke(two outcomes; right graph). In both graphs death before outcome occurrenceis a competing risk.before breast cancer occurrence. This competing risk of death is illustrated inthe left graph of Figure 35.4. Based on her follow-up during the risk periodshe is classified as: a) outcome-positive (develops breast cancer); b) outcomenegative(dies before breast cancer or is alive and breast-cancer-free at end ofperiod); or c) censored (last observed alive and free of breast cancer beforeend of period). Competing risk theory is needed to estimate her breast cancerprobability in these circumstances. Most risk models assume that mortalityrates depend only on age at risk, sex and race/ethnicity. However covariates forco-morbidities are likely to be available and could be important in risk modelperformance and validation among older cohort subjects. Thus we need toexpand existing risk models to include covariates associated with mortalityrisk, and to examine the effect of this expansion on risk model performance.There also is need to examine the feasibility of expanding existing riskmodels to include multiple outcomes of interest. For example, an osteoporoticwoman might need to weigh the risks and benefits of several fracturepreventiveoptions (e.g., tamoxifen, a bisphosphonate, or no drug). If she hasa strong family history of certain chronic diseases (e.g., breast cancer, stroke)she needs a model that provides accurate estimates of her risks of these outcomesunder each of the options she is considering. Her marginal outcomerisks may be estimable from existing single-outcome risk models, but thesemodels do not accommodate correlated risks for different outcomes. Also theywere calibrated to cohorts with different selection factors and different covariatedistributions, so their estimates may not be comparable. The graph onthe right of Figure 35.4 indicates that the complexity of multi-state modelsfor stochastic processes increases exponentially with the number of outcomesconsidered. (Here breast cancer (B) and stroke (S) are transient states, sincesubjects in these states are at risk for the other outcome, while death (D) anddevelopment of both breast cancer and stroke (BS) are absorbing states.)Work is needed to determine whether the rich body of work on multistatestochastic processes can be applied to cohort data to provide more realisticrisk estimates for multiple, competing and noncompeting outcomes.


A.S. Whittemore 401Consider, for example, the simple problem of combining existing models forbreast cancer and stroke to assign to each cohort subject a vector of theform (P (B),P(S),P(D), P (BS)), where, for example, P (B) isherassignedrisk of developing breast cancer, and P (BS) isherassignedriskofdevelopingboth breast cancer and stroke, during the risk period. The resultingmultistate model would allow the possibility of within-person correlation inoutcome-specific risks.35.6 Concluding remarksThis chapter has outlined some statistical problems that arise when assigningpeople individualized probabilities of adverse health outcomes, and when evaluatingthe utility of these assignments. Like other areas in statistics, this workrests on foundations that raise philosophical issues. For example, the discussionof risk model calibration assumes that each individual has an unknownprobability of developing the outcome of interest in the given time period. Thispresumed probability depends on his/her values for a set of risk factors, onlysome of which are known. But, unlike a parameter that governs a statisticaldistribution, one person’s “true risk” does not lend itself to straightforwarddefinition. Yet, much of the previous discussion requires this assumption.Even when the assumption is accepted, fundamental issues arise. How canwe estimate one woman’s breast cancer probability without aggregating hersurvival data with those of others who may have different risks? And howmuch of a woman’s breast cancer risk is due purely to chance? If we knew thecombined effects of all measurable breast cancer risk factors, and if we couldapply this knowledge to assign risks to disease-free women, how much residualvariation in subsequent outcomes might we see?These issues notwithstanding, it seems clear that the need for cost-efficient,high quality health care will mandate individualized strategies for preventionand treatment. Difficult cost-benefit tradeoffs will become increasingly commonas we discover new drugs and therapies with adverse side effects. Patientsand their clinical caregivers need rigorous, evidence-based guidance in makingthe choices confronting them.ReferencesAkritas, M.G. (1994). Nearest neighbor estimation of a bivariate distributionunder random censoring. The Annals of Statistics, 22:1299–1327.


402 Good health: Statistical challengesBernstein, L., Allen, M., Anton-Culver, H., Deapen, D., Horn-Ross, P.L.,Peel, D., Pinder, R., Reynolds, P., Sullivan-Halley, J., West, D., Wright,W., Ziogas, A., and Ross, R.K. (2002). High breast cancer incidence ratesamong California teachers: Results from the California Teachers Study(United States). Cancer Causes Control, 13:625–635.Brier, G.W. (1950). Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78:1–3.Costantino, J.P., Gail, M.H., Pee, D., Anderson, S., Redmond, C.K., Benichou,J., and Wieand, H.S. (1999). Validation studies for models projectingthe risk of invasive and total breast cancer incidence. Journal of theNational Cancer Institute, 91:1541–1548.Ernster, V.L. (1994). Nested case-control studies. Preventive Medicine,23:587–590.Fisher, B., Costantino, J.P., Wickerham, D.L., Cecchini, R.S., Cronin, W.M.,and Robidoux, A. (2005). Tamoxifen for the prevention of breast cancer:current status of the National Surgical Adjuvant Breast and BowelProject P–1 study. Journal of the National Cancer Institute, 97:1652–1662.Fisher, B., Costantino, J.P., Wickerham, D.L., Redmond, C.K., Kavanah,M., Cronin, W.M., Vogel, V., Robidoux, A., Dimitrov, N., Atkins, J.,Daly, M., Wieand, S., Tan-Chiu, E., Ford, L., and Wolmark, N. (1998).Tamoxifen for prevention of breast cancer: Report of the National SurgicalAdjuvant Breast and Bowel Project P–1 Study. Journal of the NationalCancer Institute, 90:1371–1388.Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer,C., and Mulvihill, J.J. (1989). Projecting individualized probabilities ofdeveloping breast cancer for white females who are being examined annually.Journal of the National Cancer Institute, 81:1879–1886.Hanley, J.A. and McNeil, B.J. (1982). The meaning and use of the area undera receiver operating characteristic (ROC) curve. Radiology, 143:29–36.Heagerty, P.J., Lumley, T., and Pepe, M.S. (2000). Time-dependent ROCcurves for censored survival data and a diagnostic marker. Biometrics,56:337–344.Hsu, W.R. and Murphy, A.H. (1986). The attributes diagram a geometricalframework for assessing the quality of probability forecasts. InternationalJournal of Forecasting, 2:285–293.John, E.M., Hopper, J.L., Beck, J.C., Knight, J.A., Neuhausen, S.L., Senie,R.T., Ziogas, A., Andrulis, I.L., Anton-Culver, H., Boyd, N., Buys, S.S.,Daly, M.B., O’Malley, F.P., Santella, R.M., Southey, M.C., Venne, V.L.,


A.S. Whittemore 403Venter, D.J., West, D., Whittemore, A.S., Seminara, D. and the BreastCancer Family Registry (2004). The Breast Cancer Family Registry: aninfrastructure for cooperative multinational, interdisciplinary and translationalstudies of the genetic epidemiology of breast cancer. Breast CancerResearch, 6:R375–89.Kalbfleisch, J.D. and Lawless, J.F. (1998). Likelihood analysis of multi-statemodels for disease incidence and mortality. Statistics in Medicine, 7:140–160.Kalbfleisch, J.D. and Prentice, R.L. (2002). The Statistical Analysis of FailureTime Data, 2nd edition. Wiley, New York.Luo, J., Horn, K., Ockene, J.K., Simon, M.S., Stefanick, M.L., Tong, E.,and Margolis, K.L. (2011). Interaction between smoking and obesity andthe risk of developing breast cancer among postmenopausal women: TheWomen’s Health Initiative Observational Study. American Journal ofEpidemiology, 174:919–928.Murphy, A.H. (1973). A new vector partition of the probability score. Journalof Applied Meteorology, 12:595–600.Pepe, M.S. (2003). The Statistical Evaluation of Medical Tests for Classificationand Prediction. Oxford University Press, Oxford.Pepe, M.S. and Cai, T. (2004).The analysis of placement values for evaluatingdiscriminatory measures. Biometrics, 60:528–535.Pepe, M.S., Gu, J.W., and Morris, D.E. (2010). The potential of genes andother markers to inform about risk. Cancer Epidemiology, Biomarkers &Prevention, 19:655–665.Pepe, M.S. and Janes, H.E. (2008). Gauging the performance of SNPs,biomarkers, and clinical factors for predicting risk of breast cancer. Journalof the National Cancer Institute, 100:978–979.Pepe, M.S. and Longton, G. (2005). Standardizing diagnostic markers toevaluate and compare their performance. Epidemiology, 16:598–603.Plevritis, S.K., Kurian, A.W., Sigal, B.M., Daniel, B.L., Ikeda, D.M., Stockdale,F.E., and Garber, A.M. (2006). Cost-effectiveness of screeningBRCA1/2 mutation carriers with breast magnetic resonance imaging.Journal of the American Medical Association, 295:2374–2384.Prentice, R.L. (1986). A case-cohort design for epidemiologic cohort studiesand disease prevention trials. Biometrika, 73:1–11.Putter, H., Fiocco, M., and Geskus, R.B. (2007). Tutorial in biostatistics:Competing risks and multistate models. Statistics in Medicine, 26:2389–2430.


404 Good health: Statistical challengesQuante, A.S., Whittemore, A.S., Shriver, T., Strauch, K., and Terry, M.B.(2012). Breast cancer risk assessment across the risk continuum: Geneticand nongenetic risk factors contributing to differential model performance.Breast Cancer Research, 14:R144.Rosner, B.A., Colditz, G.A., Webb, P.M., and Hankinson, S.E. (2005). Mathematicalmodels of ovarian cancer incidence. Epidemiology, 16:508–515.Tyrer, J., Duffy, S.W., and Cuzick, J. (2004). A breast cancer predictionmodel incorporating familial and personal risk factors. Statistics inMedicine, 23:1111–1130.Wacholder, S., McLaughlin, J.K., Silverman, D.T., and Mandel, J.S. (1992).Selection of controls in case-control studies. I. Principles. American Journalof Epidemiology, 135:1019–1049.Whittemore, A.S. and Halpern, J. (2013). Two-stage sampling designs forvalidating personal risk models. Statistical Methods in Medical Research,in press.Wilks, D.S. (1995). Statistical Methods in the Atmospheric Sciences. AcademicPress, London.


36Buried treasuresMichael A. NewtonDepartments of Statistics and of Biostatistics and Medical InformaticsUniversity of Wisconsin, Madison, WIKeeping pace with the highly diversified research frontier of statistics is hardenough, but I suggest that we also pay ever closer attention to great worksof the past. I offer no prescription for how to do this, but reflect instead onthree cases from my own research where my solution involved realizing a newinterpretation of an old, interesting but possibly uncelebrated result whichhad been developed in a different context.36.1 Three short stories36.1.1 Genomics meets sample surveysAssessing differential expression patterns between cancer subtypes providessome insight into their biology and may direct further experimentation. Onsimilar tissues cancer may follow distinct developmental pathways and thusproduce distinct expression profiles. These differences may be captured by thesample variance statistic, which would be large when some members of a geneset (functional category) have high expression in one subtype compared to theother, and other members go the opposite way. A case in point is a collectionof cell-cycle regulatory genes and their expression pattern in tumors related tohuman papilloma virus (HPV) infection. Pyeon et al. (2007) studied the transcriptionalresponse in n = 62 head, neck and cervical cancer samples, some ofwhich were positive for virus (HPV+) and some of which were not (HPV−).Gene-level analysis showed significant differential expression in both directions.Set-level analysis showed that one functional category stood out fromthe several thousands of known categories in having an especially large valueof between-gene/within-set sample variance. This category was detected usinga standardized sample variance statistic. The detection launched a series ofexperiments on the involved genes, both in the same tissues under alternative405


406 Buried treasuresmeasurement technology and on different tissues. The findings lead to a newhypothesis about how HPV+/− tumors differentially deregulate the cell-cycleprocesses during tumorigenesis as well as to biomarkers for HPV−associatedcancers (Pyeon et al., 2011). Figure 36.1 shows a summary of gene-level differentialexpression scores between HPV+ and HPV− cancers (so-called logfold changes), for all genes in the genome (left), as well as for m = 99 genesfrom a cell-cycle regulatory pathway.A key statistical issue in this case was how to standardize a sample variancestatistic. The gene-level data were first reduced to the log-scale foldchange between HPV+ and HPV− cell types; these x g ,forgenesg, werethenconsidered fixed in subsequent calculations. For a known functional categoryc ⊆{1,...,G} of size m, thestatisticu(x, c) measured the sample varianceof the x g ’s within c. This statistic was standardized by imagining the distributionof u(x, C), for random sets C, consideredtobedrawnuniformlyfromamong all ( Gm)possible size-m subsets of the genome. Well forgetting aboutall the genomics, the statistical question concerned the distribution of thesample variance in without-replacement finite-population sampling; in particular,I needed an expected value and variance of u(x, C) underthissampling.Not being especially well versed in the findings of finite-population sampling,I approached these moment questions from first principles and with a novice’svigor, figuring that something simple was bound to emerge. I did not makemuch progress on the variance of u(x, C), but was delighted to discover a beautifulsolution in Tukey (1950, p. 517), which had been developed far from thecontext of genomics and which was not widely cited. Tukey’s buried treasureused so-called K functions, which are set-level statistics whose expected valueequals the same statistic computed on the whole population. SubsequentlyIlearnedthatearlierR.A.Fisherhadalsoderivedthisvariance;seealsoChoet al. (2005). In any case, I was glad to have gained some insight from Tukey’sgeneral framework.36.1.2 Bootstrapping and rank statisticsResearchers were actively probing the limits of bootstrap theory when I beganmy statistics career. A case of interest concerned generalized bootstrap means.From a real-valued random sample X 1 ,...,X n , one studied the conditionaldistribution of the randomized statistic¯X W n= 1 nn∑W n,i X i ,i=1conditional on the data X i , and where the random weights W n,i were generatedby the statistician to enable the conditional distribution of ¯XW n toapproximate the marginal sampling distribution of ¯X n .Efron’sbootstrapcorrespondsto weights having a certain multinomial distribution, but indicationswere that useful approximations were available for beyond the multinomial.


M.A. Newton 407FIGURE 36.1The relative positions of m = 99 cell-cycle genes (KEGG 04110) (right) areshown in the context of all measured genes (left) when genes are sorted by logfold change between HPV+ and HPV− tumors (vertical axis). Widths in thered violin plot indicate the empirical density. KEGG 04110 had higher standardizedsample variance than any functional category in GO or KEGG. Basedon this high variance, further experiments were performed on the 10 namedgenes (right) leading to a new hypothesis about how the HPV virus deregulatesthe control of cell cycle, and to biomarkers for HPV-associated cancer.


408 Buried treasuresIn a most rewarding collaboration, David Mason and I tackled the casewhere the W n,i were exchangeable, making the seemingly superfluous observationthat ¯X W n must have the same conditional distribution, given data X i ,as the additionally randomizedT n = 1 nn∑W n,πn,i X i ,i=1where, for each n, π n,i is a uniform random permutation of the integers1,...,n.Whiletheusualbootstrapstatistichastwosourcesofrandomness(one from the data and from the bootstrap weights), this T n had yet a thirdsource, neither generated by nature or the statistician, but just imagined owingto the exchangeability of the weights. Having all three sources allowed usto condition on both the data X i and the statistician-generated weights W n,i ,and still have some randomness in T n .A quite unconnected and somewhat amazing treasure from the theory oflinear rank statistics now became relevant. Given two triangular arrays ofconstants, a n,i and b n,i ,therandomizedmeanS n =n∑a n,πn,i b n,ii=1had been studied extensively in nonparametric testing, because this is theform of the linear rank statistic. Hájek (1961) presented weak conditions onthe triangular arrays such that S n is asymptotically normal, owing to therandom shuffling caused by the π n,i . Thus, reconsidering Hájek’s result inthe new bootstrap context was the key to making progress on the weightedbootstrap problem (Mason and Newton, 1992).36.1.3 Cancer genetics and stochastic geometryAtumorismonoclonalinoriginifallitscellstracebydescenttoasingleinitiatedcell that is aberrant relative to the surrounding normal tissue (e.g., incurssome critical genetic mutation). Tumors are well known to exhibit internalheterogeneity, but this does not preclude monoclonal origin, since mutation,clonal expansion, and selection are dynamic evolutionary processes occurringwithin a tumor that move the single initiated cell to a heterogeneous collectionof descendants. Monoclonal origin is the accepted hypothesis for mostcancers, but evidence is mounting that tumors may initiate through some formof molecular interaction between distinct clones. As advanced as biotechnologyhas become, the cellular events at the point of tumor initiation remainbeyond our ability to observe directly, and so the question of monoclonal versuspolyclonal origin has been difficult to resolve. I have been fortunate towork on the question in the context of intestinal cancer, in series of projectswith W.F. Dove, A. Thliveris, and R. Halberg.


M.A. Newton 409When measured at several months of age, intestinal tracts from mice usedin the experiments were dotted with tumors. By some rather elaborate experimentaltechniques cell lineages could be marked by one of two colors: sometumors were pure in color, as one would expect under monoclonal origin, yetsome contained cells of both colors, and were thus overtly polyclonal. Thepresence of such polyclonal tumors did not raise alarm bells, since it was possiblethat separate tumors were forming in close proximity, and that they hadmerged into a single tumor mass by the time of observation. If so, the polyclonalitywas merely a consequence of random collision of independently initiatedclones, and did not represent a mechanistically important phenomenon. Theinvestigators suspected, however, that the frequency of these overtly polyclonal(heterotypic) tumors was too high to be explained by random collision,especially considering the tumor size, the overall tumor frequency, andthe lineage marker patterns. It may have been, and subsequent evidence hasconfirmed, that cellular interactions are critical in the initial stages of tumordevelopment. The statistical task at hand was to assess available data in termsof evidence against the random collision hypothesis.In modeling data on frequencies of various tumor types, it became necessaryto calculate the expected number of monoclonal tumors, biclonal tumors,and triclonal tumors when initiation events occur randomly on the intestinalsurface. This is a problem in stochastic geometry, as clones will collide if theyare sufficiently close. Like in the gene-set-variance problem, I tackled the expectedvalue using first principles and with hopes that a simple approximationmight emerge. The monoclonal and biclonal expectations were not so hard, butthe triclonal calculation gave me fits. And then I found Armitage (1949). Ina problem on the overlap of dust particles on a sampling plate, Armitage hadfaced the same expected value calculation and had provided a rather thoroughsolution, with error bounds. If N particles land at random in a region of areaA, and if they clump when they lie within δ units, then the expected numbersof singletons, clumps-of-two, and clumps-of-three particles are approximatelyµ 1 = Ne −4ψ , µ 2 =2N(ψ − 4π +3√ 3πψ 2 ), µ 3 = N{4(2π +3 √ }3)ψ 2 ,3πwhere ψ = Nπδ 2 /(4A). Fortunately I could use the framework of stochasticgeometry to link the quite different contexts (particle counting and tumorformation) and identify a path to testing the random collision hypothesis(Newton et al., 2006). The biological consequences continue to be investigated.36.2 Concluding remarksIhavefoundgreatutilityinbeautifulstatisticalfindingsthathavebeenrelativelyuncelebrated by the field and that were developed in response to prob-


410 Buried treasureslems different than I was facing. I expect there are many such buried treasures,and I encourage statisticians to seek them out even as they push forward addressingall kinds of new statistical problems. Perhaps there is very little towhat I’m saying. Had I been more prepared when launching into any of thethree cases above I might have known right away how to use the availablestatistical results. But this seems like a lot to ask; our training programs arebursting with course work and cannot be expected to explain all of the discipline’streasures. You might also argue that the great thing about statisticsand mathematics is that a single formalism works equally in all kinds of differentcontexts; my case studies do no more than express how the formalism isnot dependent upon context. Perhaps my point is more that we must continueto exercise this formalism, continue to find analogies between distinct problems,and continue to support and develop tools that make these connectionseasier to identify.Thank goodness for archiving efforts like JSTOR and the modern searchengines that help us find these treasures. All of us can help by continuingto support efforts, like open access, aiming to minimize barriers to informationflow. Authors and journals can help by making a greater effort to citekey background references and suggest links to related problems. Instructors,especially of courses in mathematical statistics, can help by emphasizing thedistinct contexts that enliven each statistical fact. Grant reviewers and tenurecommittees can help by recognizing that innovation comes not only in conjuringup new theory and methodology but also by the thoughtful developmentof existing statistical ideas in new and important contexts. Finally, thanks toJohn Tukey, Peter Armitage, and Jaroslav Hájek and others for the wonderfulresults they’ve left for us to find.“There is more treasure in books than in all the pirate’s loot on TreasureIsland and best of all, you can enjoy these riches every day of yourlife.”–Walt DisneyReferencesArmitage, P. (1949). An overlap problem arising in particle counting.Biometrika, 45:501–519.Cho, E., Cho, M.J., and Eltinge, J. (2005). The variance of the sample variancefrom a finite population. International Journal of Pure and AppliedMathematics, 21:389–396.Hájek, J. (1961). Some extensions of the Wald–Wolfowitz–Noether theorem.The Annals of Mathematical Statistics, 32:506–523.


M.A. Newton 411Mason, D.M. and Newton, M.A. (1992). A rank statistics approach to theconsistency of a general bootstrap. The Annals of Statistics, 20:1611–1624.Newton, M.A., Clipson, L., Thliveris, A.T., and Halberg, R.B. (2006). Astatistical test of the hypothesis that polyclonal intestinal tumors ariseby random collision of initiated clones. Biometrics, 62:721–727.Pyeon, D., Lambert, P.F., Newton, M.A., and Ahlquist, P.G. (2011).Biomarkers for human papilloma virus-associated cancer. US Patent No.8,012,678 B2.Pyeon, D., Newton, M.A., Lambert, P.F., den Boon, J.A., Sengupta, S., Marsit,C.J., Woodworth, C.D., Connor, J.P., Haugen, T.H., Smith, E.M.,Kelsey, K.T., Turek, L.P., and Ahlquist, P. (2007). Fundamental differencesin cell cycle deregulation in human papillomavirus-positive and humanpapillomavirus-negative head/neck and cervical cancers. Cancer Research,67:4605–4619.Tukey, J.W. (1950). Some sampling simplified. Journal of the American StatisticalAssociation, 45:501–519.


37Survey sampling: Past controversies,current orthodoxy, and future paradigmsRoderick J.A. LittleDepartment of BiostatisticsUniversity of Michigan, Ann Arbor, MI37.1 IntroductionMy contribution to this historic celebration of the COPSS concerns the field ofsurvey sampling, its history and development since the seminal paper by Neyman(1934), current orthodoxy, and a possible direction for the future. Manyencounter survey sampling through the dull prism of moment calculations,but I have always found the subject fascinating. In my first sampling course,Irememberbeingpuzzledbythedifferentformsofweightinginregression—by the inverse of the probability of selection, or by the inverse of the residualvariance (Brewer and Mellor, 1973). If they were different, which was right?My early practical exposure was at the World Fertility Survey, where I learntsome real-world statistics, and where the sampling guru was one of the giantsin the field, Leslie Kish (Kish et al., 1976). Kish was proud that the developingcountries in the project were more advanced than developed countriesin publishing appropriate estimates of standard error that incorporated thesample design. Always engaging, he shared my love of western classical musicand tolerated my model-based views. More recently, I spent time helping toset up a research directorate at the US Census Bureau, an agency that was atthe forefront of advances in applied sampling under the leadership of MauriceHansen.What distinguishes survey sampling from other branches of statistics? Thegenesis of the subject is a simple and remarkable idea — by taking a simplerandom sample from a population, reasonably reliable estimates of populationquantities can be obtained with quantifiable accuracy by sampling around athousand units, whether the population size is ten thousand or twenty million.Simple random sampling is neither optimal or even practical in manyreal-world settings, and the main developments in the field concerned complexsample designs, which include features like stratification, weighting and413


414 Survey samplingclustering. Another important aspect is its primary focus on finite populationquantities rather than parameters of models. The practical concerns of how todo probability sampling in the real world, such as the availability of samplingframes, how to exploit administrative data, and alternative modes of surveyadministration, are an important part of the field; valuable, since currentlystatistical training tends to focus on estimation and inference, neglecting designsfor collecting data.Survey sampling is notable as the one field of statistics where the prevailingphilosophy is design-based inference, with models playing a supporting role.The debates leading up to this current status quo were heated and fascinating,and I offer one view of them here. I also present my interpretation of thecurrent status quo in survey sampling, what I see as its strengths and drawbacks,and an alternative compromise between design-based and model-basedinference, Calibrated Bayes, which I find more satisfying.The winds of change can be felt in this field right now. Robert Groves, arecent Director of the US Census Bureau, wrote:“For decades, the Census Bureau has created ‘designed data’ incontrast to ‘organic data’ [···] What has changed is that the volumeof organic data produced as auxiliary to the Internet and other systemsnow swamps the volume of designed data. In 2004 the monthlytraffic on the internet exceeded 1 exabyte or 1 billion gigabytes. Therisk of confusing data with information has grown exponentially... Thechallenge to the Census Bureau is to discover how to combine designeddata with organic data, to produce resources with the most efficientinformation-to-data ratio. This means we need to learn how surveysand censuses can be designed to incorporate transaction data continuouslyproduced by the internet and other systems in useful ways.Combining data sources to produce new information not contained inany single source is the future. I suspect that the biggest payoff willlie in new combinations of designed data and organic data, not in onetype alone.” (Groves, 2011)I believe that the standard design-based statistical approach of taking arandom sample of the target population and weighting the results up to thepopulation is not adequate for this task. Tying together information from traditionalsurveys, administrative records, and other information gleaned fromcyberspace to yield cost-effective and reliable estimates requires statisticalmodeling. However, robust models are needed that have good repeated samplingproperties.Inowdiscusstwomajorcontroversiesinsurveysamplingthatshapedthecurrent state of the field.


R.J. Little 41537.2 Probability or purposive sampling?The first controversy concerns the utility of probability sampling itself. Aprobability sample is a sample where the selection probability of each of thesamples that could be drawn is known, and each unit in the population hasa non-zero chance of being selected. The basic form of probability sample isthe simple random sample, where every possible sample of the chosen size nhas the same chance of being selected.When the distribution of some characteristics is known for the population,ameasureofrepresentativenessofasampleishowclosethesampledistributionof these characteristics matches the population distribution. With simplerandom sampling, the match may not be very good, because of chance fluctuations.Thus, samplers favored methods of purposive selection where sampleswere chosen to match distributions of population characteristics. The precisenature of purposive selection is often unclear; one form is quota sampling,where interviewers are given a quota for each category of a characteristic(such as age group) and told to sample until that quota is met.In a landmark early paper on sampling, Neyman (1934) addressed thequestion of whether the method of probability sampling or purposive selectionwas better. His resolution was to advocate a method that gets the best of bothworlds, stratified sampling. The population is classified into strata based onvalues of known characteristics, and then a random sample of size n j is takenfrom stratum j, of size N j .Iff j = n j /N j ,thesamplingfractioninstratumj,is a constant, an equal probability sample is obtained where the distributionof the characteristics in the sample matches the distribution of the population.Stratified sampling was not new; see, e.g., Kaier (1897); but Neyman expandedits practical utility by allowing f j to vary across strata, and weightingsampled cases by 1/f j . He proposed what is now known as Neyman allocation,which optimizes the allocations for given variances and costs of samplingwithin each strata. Neyman’s paper expanded the practical utility of probabilitysampling, and spurred the development of other complex sample designsby Mahalanobis, Hansen, Cochran, Kish and others, greatly extending thepractical feasibility and utility of probability sampling in practice. For example,a simple random sampling of people in a country is not feasible since acomplete list of everyone in the population from which to sample is not available.Multistage sampling is needed to implement probability sampling in thissetting.There were dissenting views — simple random sampling (or equal probabilitysampling in general) is an all-purpose strategy for selecting units toachieve representativeness “on average” — it can be compared with randomizedtreatment allocation in clinical trials. However, statisticians seek optimalproperties, and random sampling is very suboptimal for some specific purposes.For example, if the distribution of X is known in population, and the


416 Survey samplingobjective is the slope of the linear regression of Y on X, it’s obviously muchmore efficient to locate half the sample at each of the extreme values of X —this minimizes the variance of the least squares slope, achieving large gainsof efficiency over equal probability sampling (Royall, 1970). But this is not aprobability sample — units with intermediate values of X have zero chanceof selection. Sampling the extremes of X does not allow checks of linearity,and lacks robustness. Royall argues that if this is a concern, choose samplesizes at intermediate values of X, ratherthanlettingthesesizesbedeterminedby chance. The concept of balanced sampling due to Royall and Herson(1973) achieves robustness by matching moments of X in the sample andpopulation. Even if sampling is random within categories of X, this is notprobability sampling since there is no requirement that all values of X areincluded. Royall’s work is persuasive, but random sampling has advantages inmultipurpose surveys, since optimizing for one objective often comes at theexpense of others.Arguments over the utility of probability sampling continue to this day. Arecent example concerns the design of the National Children’s Study (Michael,2008; Little, 2010), planned as the largest long-term study of children’s healthand development ever to be conducted in the US. The study plans to follow100,000 children from before birth to early adulthood, together with theirfamilies and environment, defined broadly to include chemical, physical, behavioral,social, and cultural influences. Lively debates were waged over therelative merits of a national probability sample over a purposive sample fromcustom-chosen medical centers. In discussions, some still confused “probabilitysample” with “simple random sample.” Probability sampling ideas wonout, but pilot work on a probability sample of households did not produceenough births. The latest plan is a form of national probability sample basedon hospitals and prenatal clinics.An equal probability design is indicated by the all-purpose nature of theNational Children’s Study. However, a sample that includes high pollutionsites has the potential to increase the variability of exposures, yielding moreprecise estimates of health effects of contaminants. A compromise with attractionsis to do a combination — say choose 80% of the sample by equalprobability methods, but retain 20% of the sample to ensure coverage of areaswith high contaminant exposures.37.3 Design-based or model-based inference?The role of probability sampling relates to ideas about estimation and inference— how we analyze the data once we have it. Neyman (1934) is widelycelebrated for introducing confidence intervals as an alternative to “inverseprobability” for inference from a probability sample. This laid the foundation


R.J. Little 417for the “design-based approach” to survey inference, where population valuesare fixed and inferences are based on the randomization distribution in theselection of units. . . although Neyman never clearly states that he regards populationvalues as fixed, and his references to Student’s t distribution suggestthat he had a distribution in mind. This leads me to the other topic of controversy,concerning design-based vs model-based inference; see, e.g., Smith(1976, 1994), Kish and Frankel (1974), Hansen et al. (1983), Kish (1995), andChambers and Skinner (2003).In design-based inference, population values are fixed, and inference isbased on the probability distribution of sample selection. Obviously, this assumesthat we have a probability sample (or “quasi-randomization,” wherewe pretend that we have one). In model-based inference, survey variables areassumed to come from a statistical model. Probability sampling is not the basisfor inference, but is valuable for making the sample selection ignorable; seeRubin (1976), Sugden and Smith (1984), and Gelman et al. (1995). There aretwo main variants of model-based inference: Superpopulation modeling, wherefrequentist inference is based on repeated samples from a “superpopulation”model; and Bayesian modeling, where fixed parameters in the superpopulationmodel are assigned a prior distribution, and inferences about finite populationquantities or parameters are based on their posterior distributions. Theargument about design-based or model-based inference is a fascinating componentof the broader debate about frequentist versus Bayesian inference ingeneral: Design-based inference is inherently frequentist, and the purest formof model-based inference is Bayes.37.3.1 Design-based inferenceMore formally, for i ∈{1,...,N}, lety i be the survey (or outcome) variableof the ith unit, where N


418 Survey samplingEstimators ̂q are chosen to have good design-based properties, such as(a) Design unbiasedness: E(̂q|Y )=Q, or(b) Design consistency: ̂q → Q as the sample size gets large (Brewer, 1979;Isaki and Fuller, 1982).It is natural to seek an estimate that is design-efficient, in the sense ofhaving minimal variance. However, it became clear that that kind of optimalityis not possible without an assumed model (Horvitz and Thompson, 1952;Godambe, 1955). Design-unbiasedness tends to be too stringent, and designconsistencyis a weak requirement (Firth and Bennett, 1998), leading to manychoices of estimates; in practice, choices are motivated by implicit models, asdiscussed further below. I now give some basic examples of the design-basedapproach.Example 1 (Estimate of a population mean from a simple randomsample): Suppose the target of inference is the population mean Q = Y =(y 1 + ···+ y N )/N and we have a simple random sample of size n, (y 1 ,...,y n ).The usual unbiased estimator is the sample mean ̂q = y =(y 1 + ···+ y n )/n,which has sampling variance V =(1− n/N)S 2 y,whereS 2 y is the populationvariance of Y . The estimated variance ̂v is obtained by replacing S 2 y in V byits sample estimate s 2 y. A 95% confidence interval for Y is y ± 1.96 √̂v.Example 2 (Design weighting): Suppose the target of inference is thepopulation total T =(y 1 + ···+ y N ), and we have a sample (y 1 ,...,y n )wherethe ith unit is selected with probability π i , i ∈{1,...,n}. FollowingHorvitzand Thompson (1952), an unbiased estimate of T is given bŷt HT =N∑w i y i I i ,i=1where w i =1/π i is the sampling weight for unit i, namely the inverse of theprobability of selection. Estimates of variance depend on the specifics of thedesign.Example 3 (Estimating a population mean from a stratified randomsample: For a stratified random sample with selection probability π j = n j /N jin stratum j, theHorvitz–ThompsonestimatorofthepopulationmeanQ =Y =(y 1 + ···+ y N )/N is the stratified mean, viz.y HT = 1 Nn J∑ ∑ jj=1 i=1N jn jy ij = y st =J∑P j y j ,j=1where P j = N j /N and y j is the sample mean in stratum j. Thecorresponding


R.J. Little 419estimate of variance iŝv st =J∑j=1(1 − n jN j) s2jn j,where s 2 j is the sample variance of Y in stratum j. Acorresponding95%confidence interval for Y is y st ± 1.96 √̂v st .Example 4 (Estimating a population mean from a PPS sample):In applications such as establishment surveys or auditing, it is common tohave measure of size X available for all units in the population. Since largeunits often contribute more to summaries of interest, it is efficient to samplethem with higher probability. In particular, for probability proportional tosize (PPS) sampling, unit i with size X = x i is sampled with probabilitycx i ,wherec is chosen to yield the desired sample size; units that come inwith certainty are sampled and removed from the pool. Simple methods ofimplementation are available from lists of population units, with cumulatedranges of size. The Horvitz–Thompson estimator̂t HT = cN∑i=1y ix iI iis the standard estimator of the population total in this setting.The Horvitz–Thompson estimator often works well in the context of PPSsampling, but it is dangerous to apply it to all situations. A useful guide is toask when it yields sensible predictions of nonsampled values from a modelingperspective. A model corresponding to the HT estimator is the HT modely iiid∼N(βxi ,σ 2 x 2 i ), (37.1)where N (µ, τ 2 )denotestheNormaldistributionwithmeanµ and varianceτ 2 .Thisleadstopredictionŝβx i ,wherêβ = n −1N ∑i=1y ix iI i ,so ̂t HT = ̂β(x 1 + ···+ x N )istheresultofusingthismodeltopredictthesampled and nonsampled values. If the HT model makes very little sense,the HT estimator and associated estimates of variance can perform poorly.The famous elephant example of Basu (1971) provides an extreme and comicillustration.Models like the HT estimator often motivate the choice of estimator inthe design-based approach. Another, more modern use of models is in modelassistedinference, where predictions from a model are adjusted to protect


420 Survey samplingagainst model misspecification. A common choice is the generalized regression(GREG) estimator, which for a total takes the form:N∑ N∑̂t GREG = ŷ i +i=1i=1y i − ŷ iπ i,where ŷ i are predictions from a model; see, e.g., Särndal et al. (1992). Thisestimator is design-consistent whether or not the model is correctly specified,and foreshadows “doubly-robust” estimators in the mainline statistics literature.37.3.2 Model-based inferenceThe model-based approach treats both I =(I 1 ,...,I N ) and Y =(y 1 ,...,y N )as random variables. A model is assumed for the survey outcomes Y withunderlying parameters θ, and this model is used to predict the nonsampledvalues in the population, and hence the finite population total. Inferences arebased on the joint distribution of Y and I. Rubin (1976) and Sugden andSmith (1984) show that under probability sampling, inferences can be basedon the distribution of Y alone, provided the design variables Z are conditionedin the model, and the distribution of I given Y is independent of the distributionof Y conditional on the survey design variables. In frequentist superpopulationmodeling, the parameters θ are treated as fixed; see, e.g., Valliantet al. (2000). In Bayesian survey modeling, the parameters are assigned a priordistribution, and inferences for Q(Y ) are based on its posterior predictive distribution,given the sampled values; see, e.g., Ericson (1969), Binder (1982),Rubin (1987), Ghosh and Meeden (1997), Little (2004), Sedransk (2008), Fienberg(2011), and Little (2012). I now outline some Bayesian models for theexamples discussed above.Example 1 continued (Bayes inference for a population mean fromasimplerandomsample):Abasicmodelforsimplerandomsamplingisy i |µ, σ 2 iid∼N(µ, σ 2 ),with a Jeffreys’ prior on the mean and variance p(µ, log σ 2 )=aconstant.Aroutine application of Bayes theorem yields a t distribution for the posteriordistribution of Y , with mean y, scale s √ 1 − n/N and degrees of freedom n−1.The 95% credibility interval is the same as the frequentist confidence intervalabove, except that the normal percentile, 1.96, is replaced by the t percentile,as is appropriate since the variance is estimated. Arguably this interval issuperior to the normal interval even if the data is not normal, although bettermodels might be developed for that situation.


R.J. Little 421Example 2 continued (Bayesian approaches to design weighting):Weighting of cases by the inverse of the probability of selection is not reallya model-based tool, although (as in the next example) model-based estimatescorrespond to design-weighted estimators for some problems. Design weightsare conceived more as covariates in a prediction model, as illustrated in Example4 below.Example 3 continued (Estimating a population mean from a stratifiedrandom sample): For a stratified random sample, the design variablesZ consist of the stratum indicators, and conditioning on Z suggests that modelsneed to have distinct stratum parameters. Adding a subscript j for stratumto the normal model for Example 1 leads toy i |µ j ,σ 2 jind∼N(µ j ,σ 2 j ),with prior p(µ j , log σj 2 ) = a constant. The resulting posterior distribution recoversthe stratified mean as the posterior mean and the stratified variance forthe posterior variance, when the variances σj 2 are assumed known. Estimatingthe variances leads to the posterior distribution as a mixture of t distributions.Many variants of this basic normal model are possible.Example 4 continued (Estimating a population mean from a PPSsample): The posterior mean from the HT model (37.1) is equivalent tothe HT estimator, aside from finite population. Zhen and Little (2003) relaxthe linearity assumption of the mean structure, modeling the mean of Y givensize X as a penalized spline; see also and Zheng and Little (2005). Simulationssuggest that this model yields estimates of the total that have superior meansquared error than the HT estimator when the HT model is misspecified.Further, posterior credible intervals from the expanded model have betterconfidence coverage.37.3.3 Strengths and weaknessA simplified overview of the two schools of inference is that weighting is a fundamentalfeature of design-based methods, with models playing a secondaryrole in guiding the choice of estimates and providing adjustments to increaseprecision. Model-based inference is much more focused on predicting nonsampled(or nonresponding) units with estimates of uncertainty. The modelneeds to reflect features of the design like stratification and clustering to limitthe effects of model misspecification, as discussed further below. Here is mypersonal assessment of the strengths and weaknesses of the approaches.The attraction of the design-based perspective is that it avoids direct dependenceon a model for the population values. Models can help the choiceof estimator, but the inference remains design-based, and hence somewhatnonparametric. Models introduce elements of subjectivity — all models are


422 Survey samplingwrong, so can we trust results? Design-based properties like design consistencyare desirable since they apply regardless of the validity of a model. Computationally,weighting-based methods have attractions in that they can be applieduniformly to a set of outcomes, and to domain and cross-class means, whereasmodeling needs more tailoring to these features.A limitation of the design-based perspective is that inference is based onprobability sampling, but true probability samples are harder and harder tocome by. In the household sample setting, contact is harder — there are fewertelephone land-lines, and more barriers to telephonic contact; nonresponse isincreasing, and face-to-face interviews are increasingly expensive. As Grovesnoted in the above-cited quote, a high proportion of available information isnow not based on probability samples, but on ill-defined population frames.Another limitation of design-based inference is that it is basically asymptotic,and provides limited tools for small samples, such as for small areaestimation. The asymptotic nature leads to (in my opinion) too much emphasison estimates and estimated standard errors, rather than obtaining intervalswith good confidence coverage. This is reflected by the absence of t correctionsfor estimating the variances in Examples 1 and 3 above.On a more theoretical level, design-based inference leads to ambiguitiesconcerning what to condition on in the “reference set” for repeated sampling.The basic issue is whether to condition on ancillary statistics — if conditioningon ancillaries is taken seriously, it leads to the likelihood principle (Birnbaum,1962), which design-based inference violates. Without a model for predictingnon-sampled cases, the likelihood is basically uninformative, so approachesthat follow the likelihood principle are doomed to failure.As noted above, design-based inference is not explicitly model-based, butattempting design-based inference without any reference to implicit modelsis unwise. Models are needed in design-based approach, as in the “modelassisted”GREG estimator given above.The strength of the model-based perspective is that it provides a flexible,unified approach for all survey problems — models can be developed for surveysthat deal with frame, nonresponse and response errors, outliers, smallarea models, and combining information from diverse data sources. Adoptingamodelingperspectivemovessurveysampleinferenceclosertomainstreamstatistics, since other disciplines like econometrics, demography, public health,rely on statistical modeling. The Bayesian modeling requires specifying priors,but has that benefit that it is not asymptotic, and can provide bettersmall-sample inferences. Probability sampling justified as making samplingmechanism ignorable, improving robustness.The disadvantage of the model-based approach is more explicit dependenceon the choice of model, which has subjective elements. Survey statisticians aregenerally conservative, and unwilling to trust modeling assumptions, given theconsequences of lack of robustness to model misspecification. Developing goodmodels requires thought and an understanding of the data, and models havethe potential for more complex computations.


R.J. Little 42337.3.4 The design-model compromiseEmerging from the debate over design-based and model-based inference isthe current consensus, which I have called the Design-Model Compromise(DMC); see Little (2012). Inference is design-based for aspects of surveys thatare amenable to that approach, mainly inferences about descriptive statisticsin large probability samples. These design-based approaches are often modelassisted, using methods such as regression calibration to protect against modelmisspecification; see, e.g., Särndal et al. (1992). For problems where the designbasedapproach is infeasible or yields estimates with insufficient precision,such as small area estimation or survey nonresponse, a model-based approachis adopted. The DMC approach is pragmatic, and attempts to exploit thestrengths of both inferential philosophies. However, it lacks a cohesive overarchingphilosophy, involving a degree of “inferential schizophrenia” (Little,2012).I give two examples of “inferential schizophrenia.” More discussion andother examples are given in Little (2012). Statistical agencies like the USCensus Bureau have statistical standards that are generally written from adesign-based viewpoint, but researchers from social science disciplines likeeconomics are trained to build models. This dichotomy leads to friction whensocial scientists are asked to conform to a philosophy they view as alien. Socialscience models need to incorporate aspects like clustering and stratification toyield robust inferences, and addressing this seems more likely to be successfulfrom a shared modeling perspective.Another example is that the current paradigm generally employs directdesign-based estimates in large samples, and model-based estimates in smallsamples. Presumably there is some threshold sample size where one is designbased for larger samples and model based for smaller samples. This leads toinconsistency, and ad-hoc methods are needed to match direct and modelestimates at different levels of aggregation. Estimates of precision are lesseasily reconciled, since confidence intervals from the model tend to be smallerthan direct estimates because the estimates “borrow strength.” Thus, it isquite possible for a confidence interval for a direct estimate to be wider thana confidence interval for a model estimate based on a smaller sample size,contradicting the notion that uncertainty decreases as information increases.37.4 A unified framework: Calibrated BayesSince a comprehensive approach to survey inference requires models, a unifiedtheory has to be model-based. I have argued (Little, 2012) that the appropriateframework is calibrated Bayes inference (Box, 1980; Rubin, 1984; Little, 2006),where inferences are Bayesian, but under models that yield inferences with


424 Survey samplinggood design-based properties; in other words, Bayesian credibility intervalswhen assessed as confidence intervals in repeated sampling should have closeto nominal coverage. For surveys, good calibration requires that Bayes modelsshould incorporate sample design features such as weighting, stratification andclustering. Weighting and stratification is captured by included weights andstratifying variables as covariates in the prediction model; see, e.g., Gelman(2007). Clustering is captured by Bayesian hierarchical models, with clustersas random effects. Prior distributions are generally weakly informative, so thatthe likelihood dominates the posterior distribution.Why do I favor Bayes over frequentist superpopulation modeling? Theoretically,Bayes has attractive properties if the model is well specified, andputting weakly informative prior distributions over parameters tends to propagateuncertainty in estimating these parameters, yielding better frequentistconfidence coverage than procedures that fix parameters at their estimates.The penalized spline model in Example 4 above is one example of a calibratedBayes approach, and others are given in Little (2012). Here is one more concludingexample.Example 5 (Calibrated Bayes modeling for stratified sampling withasizecovariate):A common model for estimating a population mean of avariable Y from a simple random sample (y 1 ,...,y n ), with a size variable Xmeasured for all units in the population, is the simple ratio modely i |x i ,µ,σ 2 ind∼N(βx i ,σ 2 x i ),for which predictions yield the ratio estimator y rat = X × y/x, wherey andx are sample means of Y and X and X is the population mean of X. Hansenet al. (1983) suggest that this model is deficient when the sample is selectedby disproportionate stratified sampling, yielding biased inferences under relativelyminor deviations from the model. From a calibrated Bayes perspective,the simple ratio model does not appropriately reflect the sample design. Analternative model that does this is the separate ratio modely i |x i ,z i = j, µ j ,σ 2 jind∼N(β j x i ,σ 2 j x i ),where z i = j indicates stratum j. Predictionsfromthismodelleadtotheseparate ratio estimatorJ∑ y jy sep = P j X j ,x jj=1where P j is the proportion of the population in stratum j. Thisestimatorcan be unstable if sample sizes in one or more strata are small. A Bayesianmodification is to treat the slopes β j as N (β,τ 2 ), which smooths the estimatetowards something close to the simple ratio estimate. Adding priordistributions for the variance components provides Bayesian inferences thatincorporate errors for estimating the variances, and also allows smoothing ofthe stratum-specific variances.


R.J. Little 42537.5 ConclusionsI am a strong advocate of probability sampling, which has evolved into a flexibleand objective design tool. However, probability samples are increasinglyhard to achieve, and the strict design-based view of survey inference is toorestrictive to handle all situations. Modeling is much more flexible, but modelsneed to be carefully considered, since poorly chosen models lead to poorinferences. The current design-model compromise is pragmatic, but lacks acoherent unifying principle. Calibrated Bayes provides a unified perspectivethat blends design-based and model-based ideas. I look forward to further developmentof this approach, leading to more general acceptance among surveypractitioners. More readily-accessible and general software is one area of need.Hopefully this brief traverse of survey sampling in the last eighty yearshas piqued your interest. It will be interesting to see how the field of surveysampling evolves in the next eighty years of the existence of COPSS.AcknowledgementsThis work was supported as part of an Interagency Personnel Agreement withthe US Census Bureau. The views expressed on statistical, methodological,technical, or operational issues are those of the author and not necessarilythose of the US Census Bureau.ReferencesBasu, D. (1971). An essay on the logical foundations of survey sampling,part I (with discussion). In Foundations of Statistical Inference (V.P.Godambe and D.A. Sprott, Eds.). Holt, Rinehart and Winston, Toronto,pp. 203–242.Binder, D.A. (1982). Non-parametric Bayesian models for samples from finitepopulations. Journal of the Royal Statistical Society, Series B, 44:388–393.Birnbaum, A. (1962). On the foundations of statistical inference (with discussion).Journal of the American Statistical Association, 57:269–326.


426 Survey samplingBox, G.E.P. (1980). Sampling and Bayes inference in scientific modellingand robustness (with discussion). Journal of the Royal Statistical Society,Series A, 143:383–430.Brewer, K.R.W. (1979). A class of robust sampling designs for large-scalesurveys. Journal of the American Statistical Association, 74:911–915.Brewer, K.R.W. and Mellor, R.W. (1973). The effect of sample structure onanalytical surveys. Australian Journal of Statistics, 15:145–152.Chambers, R.L. and Skinner, C.J. (2003). Analysis of Survey Data. Wiley,New York.Cochran, W.G. (1977). Sampling Techniques, 3rd edition. Wiley, New York.Ericson, W.A. (1969). Subjective Bayesian models in sampling finite populations(with discussion). Journal of the Royal Statistical Society, SeriesB, 31:195–233.Fienberg, S.E. (2011). Bayesian models and methods in public policy andgovernment settings. Statistical Science, 26:212–226.Firth, D. and Bennett, K.E. (1998). Robust models in probability sampling.Journal of the Royal Statistical Society, Series B, 60:3–21.Gelman, A. (2007). Struggles with survey weighting and regression modeling(with discussion). Statistical Science, 22:153–164.Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (1995). BayesianData Analysis. Chapman&Hall,London.Ghosh, M. and Meeden, G. (1997). Bayesian Methods for Finite PopulationSampling. Chapman&Hall,London.Godambe, V.P. (1955). A unified theory of sampling from finite populations.Journal of the Royal Statistical Society, Series B, 17:269–278.Groves, R.M. (2011) The future of producing social and economic statisticalinformation, Part I. Director’s Blog, www.census.gov, September8,2011.US Census Bureau, Department of Commerce, Washington DC.Hansen, M.H., Madow, W.G., and Tepping, B.J. (1983). An evaluationof model-dependent and probability-sampling inferences in sample surveys(with discussion). Journal of the American Statistical Association,78:776–793.Horvitz, D.G. and Thompson, D.J. (1952). A generalization of sampling withoutreplacement from a finite universe. Journal of the American StatisticalAssociation, 47:663–685.


R.J. Little 427Isaki, C.T. and Fuller, W.A. (1982). Survey design under the regression superpopulationmodel. Journal of the American Statistical Association,77:89–96.Kaier, A.N. (1897). The Representative Method of Statistical Surveys [1976,English translation of the original Norwegian]. Statistics Norway, Oslo,Norway.Kish, L. (1995). The hundred years’ wars of survey sampling. Statistics inTransition,2:813–830.[ReproducedasChapter1ofLeslie Kish: SelectedPapers (G. Kalton and S. Heeringa, Eds.). Wiley, New York, 2003].Kish, L. and Frankel, M.R. (1974). Inferences from complex samples (withdiscussion). Journal of the Royal Statistical Society, Series B, 36:1–37.Kish, L., Groves, L.R., and Krotki, K.P. (1976). Standard errors from fertilitysurveys. World Fertility Survey Occasional Paper 17, InternationalStatistical Institute, The Hague, Netherlands.Little, R.J.A. (2004). To model or not to model? Competing modes of inferencefor finite population sampling. Journal of the American StatisticalAssociation, 99:546–556.Little, R.J.A. (2006). Calibrated Bayes: A Bayes/frequentist roadmap. TheAmerican Statistician, 60:213–223.Little, R.J.A. (2010). Discussion of articles on the design of the NationalChildren’s Study. Statistics in Medicine, 29:1388–1390.Little, R.J.A. (2012). Calibrated Bayes: An alternative inferential paradigmfor official statistics (with discussion). Journal of Official Statistics,28:309–372.Michael, R.T. and O’Muircheartaigh, C.A. (2008). Design priorities and disciplinaryperspectives: the case of the US National Children’s Study. Journalof the Royal Statistical Society, Series A, 171:465–480.Neyman, J. (1934). On the two different aspects of the representative method:The method of stratified sampling and the method of purposive selection.Journal of the Royal Statistical Society, 97:558–606.Rao, J.N.K. (2011). Impact of frequentist and Bayesian methods on surveysampling practice: A selective appraisal. Statistical Science, 26:240–256.Royall, R.M. (1970). On finite population sampling theory under certainlinear regression models. Biometrika, 57:377–387.Royall, R.M. and Herson, J.H. (1973). Robust estimation in finite populations,I and II. Journal of the American Statistical Association, 68:880–893.


428 Survey samplingRubin, D.B. (1976). Inference and missing data. Biometrika, 53:581–592.Rubin, D.B. (1984). Bayesianly justifiable and relevant frequency calculationsfor the applied statistician. The Annals of Statistics, 12:1151–1172.Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley,New York.Särndal, C.-E., Swensson, B., and Wretman, J.H. (1992). Model AssistedSurvey Sampling. Springer,NewYork.Sedransk, J. (2008). Assessing the value of Bayesian methods for inferenceabout finite population quantities. Journal of Official Statistics, 24:495–506.Smith, T.M.F. (1976). The foundations of survey sampling: A review (withdiscussion). Journal of the Royal Statistical Society, Series A, 139:183–204.Smith, T.M.F. (1994). Sample surveys 1975–1990: An age of reconciliation?(with discussion). International Statistical Review, 62:5–34.Sugden, R.A. and Smith, T.M.F. (1984). Ignorable and informative designsin survey sampling inference. Biometrika, 71:495–506.Valliant, R., Dorfman, A.H., and Royall, R.M. (2000). Finite PopulationSampling and Inference: A Prediction Approach. Wiley,NewYork.Zheng, H. and Little, R.J.A. (2003). Penalized spline model-based estimationof the finite population total from probability-proportional-to-sizesamples. Journal of Official Statistics, 19:99–117.Zheng, H. and Little, R.J.A. (2005). Inference for the population total fromprobability-proportional-to-size samples based on predictions from a penalizedspline nonparametric model. Journal of Official Statistics, 21:1–20.


38Environmental informatics: Uncertaintyquantification in the environmental sciencesNoel CressieNIASRA, School of Mathematics and Applied StatisticsUniversity of Wollongong, Wollongong, NSW, Australia38.1 IntroductionThis exposition of environmental informatics is an attempt to bring currentthinking about uncertainty quantification to the environmental sciences. Environmentalinformatics is a term that I first heard being used by BronwynHarch of Australia’s Commonwealth Scientific and Industrial Research Organisationto describe a research theme within her organization. Just as bioinformaticshas grown and includes biostatistics as a sub-discipline, environmentalinformatics, or EI, has the potential to be much broader than classical environmentalstatistics; see, e.g., Barnett (2004).Which came first, the hypothesis or the data? In EI, we start with environmentaldata, but we use them to reveal, quantify, and validate scientifichypotheses with a panoply of tools from statistics, mathematics, computing,and visualization.There is a realization now in science, including the environmental sciences,that there is uncertainty in the data, the scientific models, and the parametersthat govern these models. Quantifying that uncertainty can be approached in anumber of ways. To some, it means smoothing the data to reveal interpretablepatterns; to data miners, it often means looking for unusual data points ina sea of “big data”; and to statisticians, it means all of the above, usingstatistical modeling to address questions like, “Are the patterns real?” and“Unusual in relation to what?”In the rest of this chapter, I shall develop a vision for EI around the beliefthat Statistics is the science of uncertainty, and that behind every gooddata-mining or machine-learning technique is an implied statistical model.Computing even something as simple as a sample mean and a sample variancecan be linked back to the very simplest of statistical models with alocation parameter and additive homoscedastic errors. The superb book by429


430 Environmental informaticsHastie, Tibshirani, and Friedman (Hastie et al., 2009) shows the fecundity ofestablishing and developing such links. EI is a young discipline, and I wouldlike to see it develop in this modern and powerful way, with uncertainty quantificationthrough Statistics at its core.In what follows, I shall develop a framework that is fundamentally aboutenvironmental data and the processes that produced them. I shall be particularlyconcerned with big, incomplete, noisy datasets generated by processesthat may be some combination of non-linear, multi-scale, non-Gaussian, multivariate,and spatio-temporal. I shall account for all the known uncertaintiescoherently using hierarchical statistical modeling, or HM (Berliner, 1996),which is based on a series of conditional-probability models. Finally, throughloss functions that assign penalties as a function of how far away an estimateis from its estimand, I shall use a decision-theoretic framework (Berger, 1985)to give environmental policy-makers a way to make rational decisions in thepresence of uncertainty, based on competing risks (i.e., probabilities).38.2 Hierarchical statistical modelingThe building blocks of HM are the data model, the (scientific) process model,and the parameter model. If Z represents the data, Y represents the process,and θ represents the parameters (e.g., measurement-error variance andreaction-diffusion coefficients), then the data model isthe process model isand the parameter model is[Z|Y,θ], (38.1)[Y |θ], (38.2)[θ], (38.3)where [A|B,C] is generic notation for the conditional-probability distributionof the random quantity A given B and C.Astatisticalapproachrepresentstheuncertaintiescoherentlythroughthejoint-probability distribution, [Z, Y, θ]. Using the building blocks (38.1)–(38.3),we can write[Z, Y, θ] =[Z|Y,θ] × [Y |θ] × [θ]. (38.4)The definition of entropy of a random quantity A is E(ln[A]) By re-writing(38.4) asE(ln[Z, Y, θ]) = E(ln[Z|Y,θ]) + E(ln[Y |θ]) + E(ln[θ]),we can see that the joint entropy can be partitioned into data-model entropy,process-model entropy, and parameter-model entropy. This results in a “divide


N. Cressie 431and conquer” strategy that emphasizes where scientists can put effort intounderstanding the sources of uncertainty and into designing scientific studiesthat control (and perhaps minimize some of) the entropy components.The process Y and the parameters θ are unknown, but the data Z areknown. (Nevertheless, the observed Z is still thought of as one of many possiblethat could have been observed, with a distribution [Z].) At the beginning of allstatistical inference is a step that declares what to condition on, and I proposethat EI follow the path of conditioning on what is known, namely Z. Thenthe conditional probability distribution of all the unknowns given Z is[Y,θ|Z] =[Z, Y, θ]/[Z] =[Z|Y,θ] × [Y |θ] × [θ]/[Z], (38.5)where the first equality is known as Bayes’ Theorem (Bayes, 1763); (38.5) iscalled the posterior distribution, and we call (38.1)–(38.3) a Bayesian hierarchicalmodel (BHM). Notice that [Z] on the right-hand side of (38.5) is anormalizing term that ensures that the posterior distribution integrates (orsums) to 1.There is an asymmetry associated with the role of Y and θ, since(38.2)very clearly emphasises that [Y |θ] is where the “science” resides. It is equallytrue that [Y,θ] =[θ|Y ] × [Y ]. However, probability models for [θ|Y ] and [Y ]do not follow naturally from the way that uncertainties are manifested. Theasymmetry emphasizes that Y is often the first priority for inference. As a consequence,we define the predictive distribution, [Y |Z], which can be obtainedfrom (38.5) by marginalization:[Y |Z] =∫ [Z|Y,θ] × [Y |θ] × [θ]dθ/[Z]. (38.6)Then inference on Y is obtained from (38.6). While (38.5) and (38.6) areconceptually straightforward, in EI we may be trying to evaluate them inglobal spatial or spatio-temporal settings where Z might be on the order ofGb or Tb, and Y might be of a similar order. Thus, HM requires innovativeconditional-probability modeling in (38.1)–(38.3), followed by innovativestatistical computing in (38.5) and (38.6). Leading cases involve spatial data(Cressie, 1993; Banerjee et al., 2004) and spatio-temporal data (Cressie andWikle, 2011). Examples of dynamical spatio-temporal HM are given in Chapter9 of Cressie and Wikle (2011), and we also connect the literature in dataassimilation, ensemble forecasting, blind-source separation, and so forth to theHM paradigm.38.3 Decision-making in the presence of uncertaintyLet Ŷ (Z) be one of many decisions about Y based on Z. Some decisions arebetter than others, which can be quantified through a (non-negative) loss function,L{Y,Ŷ (Z)}. TheBayesexpectedlossisE{L(Y,Ŷ )}, and we minimize


432 Environmental informaticsthis with respect to Ŷ . Then it is a consequence of decision theory (Berger,1985) that the optimal decision isY ∗ (Z) = arg inf E{L(Y,Ŷ )|Z}, (38.7)Ŷwhere for some generic function g,thenotationE{g(Y )|Z} is used to representthe conditional expectation of g(Y ) given Z.Sometimes E{L(Y,Ŷ )|θ} is called the risk, but I shall call it the expectedloss; sometimes E{L(Y,Ŷ )} is called the Bayes risk, but see above where I havecalled it the Bayes expected loss. In what follows, I shall reserve the word riskto be synonymous with probability.Now, if θ were known, only Y remains unknown, and HM involves just(38.1)–(38.2). Then Bayes’ Theorem yields[Y |Z, θ] =[Z|Y,θ] × [Y |θ]/[Z|θ]. (38.8)In this circumstance, (38.8) is both the posterior distribution and the predictivedistribution; because of the special role of Y ,Iprefertocallitthepredictive distribution. The analogue to (38.7) when θ is known is, straightforwardly,Y ∗ (Z) = arg inf E{L(Y,Ŷ )|Z, θ}. (38.9)ŶClearly, Y ∗ (Z) in(38.9)alsodependsonθ.Using the terminology of Cressie and Wikle (2011), an empirical hierarchicalmodel (EHM) results if an estimate ˆθ(Z), or ˆθ for short, is used inplace of θ in (38.8): Inference on Y is then based on the empirical predictivedistribution,[Y |Z, ˆθ] =[Z, Y, ˆθ] × [Y |ˆθ]/[Z|ˆθ], (38.10)which means that ˆθ is also used in place of θ in (38.9).BHM inference from (38.5) and (38.6) is coherent in the sense that it emanatesfrom the well defined joint-probability distribution (38.4). However,the BHM requires specification of the prior [θ], and one often consumes largecomputing resources to obtain (38.5) and (38.6). The EHM’s inference from(38.10) can be much more computationally efficient, albeit with an empiricalpredictive distribution that has smaller variability than the BHM’s predictivedistribution (Sengupta and Cressie, 2013). Bayes’ Theorem applied to BHMor EHM for spatio-temporal data results in a typically very-high-dimensionalpredictive distribution, given by (38.6) or (38.10), respectively, whose computationrequires dimension reduction and statistical-computing algorithmssuch as EM (McLachlan and Krishnan, 2008), MCMC (Robert and Casella,2004), and INLA (Rue et al., 2009). For additional information on dimensionreduction, see, e.g., Wikle and Cressie (1999), Wikle et al. (2001), Cressie andJohannesson (2006), Banerjee et al. (2008), Cressie and Johannesson (2008),Kang and Cressie (2011), Katzfuss and Cressie (2011), Lindgren et al. (2011),and Nguyen et al. (2012).


N. Cressie 433In the last 20 years, methodological research in Statistics has seen a shiftfrom mathematical statistics towards statistical computing. Deriving an analyticalform for (38.6) or (38.10) is almost never possible, but being able tosample realizations from them often is. This shift in emphasis has enormouspotential for EI.For economy of exposition, I feature the BHM in the following discussion.First, if I can sample from the posterior distribution, [Y,θ|Z], I can automaticallysample from the predictive distribution, [Y |Z], by simply ignoring theθ’s in the posterior sample of (Y,θ). This is called a marginalization propertyof sampling. Now suppose there is scientific interest in a summary g(Y ) of Y(e.g., regional averages, or regional extremes). Then an equivariance propertyof sampling implies that samples from [g(Y )|Z] are obtained by sampling from[Y |Z] and simple evaluating each member of the sample at g. Thisequivarianceproperty is enormously powerful, even more so when the sampling doesnot require knowledge of the normalizing term [Z] in(38.5).Thebestknownstatistical computing algorithm that samples from the posterior and predictivedistributions is MCMC; see, e.g., Robert and Casella (2004).Which summary of the predictive distribution [g(Y )|Z] willbeusedtoestimate the scientifically interesting quantity g(Y )? Too often, the posteriormean,∫E{g(Y )|Z} = g(Y )[Y |Z]dY,is chosen as a “convenient” estimator of g(Y ). This is an optimal estimatorwhen the loss function is squared-error: L{g(Y ), ĝ} = {ĝ − g(Y )} 2 ;see,e.g.,Berger (1985). However, squared-error loss assumes equal consequences (i.e.,loss) for under-estimation as for over-estimation. When a science or policyquestion is about extreme events, the squared-error loss function is strikinglyinadequate, yet scientific inference based on the posterior mean is ubiquitous.Even if squared-error loss were appropriate, it would be incorrect to computeE(Y |Z) and produce g{E(Y |Z)} as an optimal estimate, unless g is alinear functional of Y . However, this is also common in the scientific literature.Under squared-error loss, the optimal estimate is E{g(Y )|Z}, whichisdefined above. Notice that aggregating over parts of Y defines a linear functionalg, but that taking extrema over parts of Y results in a highly non-linearfunctional g. Consequently, the supremum/infimum of the optimal estimateof Y (i.e., g{E(Y |Z)}) is a severe under-estimate/over-estimate of the supremum/infimumof Y ,i.e.,g(Y ).38.4 Smoothing the dataEI is fundamentally linked to environmental data and the questions that resultedin their collection. Questions are asked of the scientific process Y , and


434 Environmental informaticsthe data Z paint an imperfect and incomplete picture of Y .Often,thefirsttool that comes to a scientist’s hand is a “data smoother,” which here I shallcall f. Suppose one definesỸ ≡ f(Z); (38.11)notice that f “de-noises” (i.e., filters out highly variable components) and“fills in” where there are missing data. The scientist might be tempted tothink of Ỹ as data coming directly from the process model, [Y |θ], and useclassical statistical likelihoods based on [Y = Ỹ |θ] tofitθ and hence themodel [Y |θ]. But this paradigm is fundamentally incorrect; science shouldincorporate uncertainty using a different paradigm. Instead of (38.11), supposeIwrite˜Z ≡ f(Z). (38.12)While the difference between (38.11) and (38.12) seems simply notational,conceptually it is huge.The smoothed data ˜Z should be modelled according to [ ˜Z|Y,θ], and theprocess Y can be incorporated into an HM through [Y |θ]. Scientific inferencethen proceeds from [Y | ˜Z] in a BHM according to (38.6) or from [Y | ˜Z, ˆθ] in anEHM according to (38.10). The definition given by (38.12) concentrates ourattention on the role of data, processes, and parameters in an HM paradigmand, as a consequence, it puts uncertainty quantification on firm inferentialfoundations (Cressie and Wikle, 2011, Chapter 2).Classical frequentist inference could also be implemented through amarginal model (i.e., the likelihood),∫[ ˜Z|θ] = [ ˜Z|Y,θ] × [Y |θ]dY,although this fact is often forgotten when likelihoods are formulated. As aconsequence, these marginal models can be poorly formulated or unnecessarilycomplicated when they do not recognize the role of Y in the probabilitymodelling.38.5 EI for spatio-temporal dataThis section of the chapter gives two examples from the environmental sciencesto demonstrate the power of the statistical-modelling approach to uncertaintyquantification in EI.38.5.1 Satellite remote sensingSatellite remote sensing instruments are remarkable in terms of their opticalprecision and their ability to deliver measurements under extreme conditions.


N. Cressie 435Once the satellite has reached orbit, the instrument must function in a nearvacuum with low-power requirements, sensing reflected light (in the case of apassive sensor) through a highly variable atmospheric column.The specific example I shall discuss here is that of remote sensing of atmosphericCO 2 ,agreenhousegaswhoseincreaseishaving,andwillhave,alarge effect on climate change. The global carbon cycle describes where carbonis stored and the movement of carbon between these reservoirs. The oceansand vegetation/soil are examples of CO 2 sinks, and fires and anthropogenicemissions are examples of CO 2 sources; of the approximately 8 Gt per yearthat enters the atmosphere, about half is anthropogenic. About 4 Gt staysin the atmosphere and the other 4 Gt is absorbed roughly equally by theoceans and terrestrial processes. This global increase of approximately 4 Gtof atmospheric CO 2 per year is unsustainable in the long term.It is of paramount importance to be able to characterize precisely whereand when sinks (and sources) occur. Because of a lack of globally extensive,extremely precise, and very densely sampled CO 2 data, these are largelyunknown. Once the spatial and temporal variability of the carbon cycle isunderstood, regional climate projections can be made, and rational mitigation/adaptationpolicies can be implemented.Although the atmosphere mixes rapidly (compared to the oceans), thereis a lot of spatial variability as a function of both surface location and (geopotential)height. There is also a lot of temporal variability at any givenlocation, as is clear from the US National Oceanic and Atmospheric Administration’sCO 2 daily measurements from their Mauna Loa (Hawaii) observatory.Hence, we define atmospheric CO 2 as a spatio-temporal process, Y (s; t),at spatial co-ordinates s and time t. Here, s consists of (lon, lat) = (x, y) andgeo-potential height h, thatbelongstothespatialdomainD s , the extent ofthe atmosphere around Earth; and t belongs to a temporal domain D t (e.g.,t might index days in a given month).There are several remote sensing instruments that measure atmosphericCO 2 (e.g., NASA’s AIRS instrument and the Japanese space agency’s GOSATinstrument); to improve sensitivity to near-surface CO 2 , NASA built theOCO-2 instrument. (The original OCO satellite failed to reach orbit in 2009.)It allows almost pinpoint spatial-locational accuracy (the instrument’s footprintis 1.1×2.25 km), resulting in high global data densities during any givenmonth. However, its small footprint results in quite a long repeat-cycle of 16days, making it harder to capture daily temporal variability at high spatialresolution. I am a member of NASA’s OCO-2 Science Team that is concernedwith all components of the data-information-knowledge pyramid referred tobelow in Section 38.6.The physics behind the CO 2 retrieval requires measurements of CO 2 in theso-called strong CO 2 and weak CO 2 bands of the spectrum, and of O 2 in theoxygen A-band (Crisp et al., 2004). The result is a data vector of radiancesZ(x, y; t), where (x, y) = (lon, lat) is the spatial location on the geoid, D g ≡(−180 ◦ , +180 ◦ ) × (−90 ◦ , +90 ◦ ), of the atmospheric column from footprint to


436 Environmental informaticssatellite; and t is the time interval (e.g., a day or a week) during which themeasurements (i.e., radiances) for that column were taken, where t rangesover the period of interest D t .Thisvectorisseveral-thousanddimensional,and there are potentially many thousands of such vectors per time interval.Hence, datasets can be very large.The data are “noisy” due to small imperfections in the instrument, ubiquitousdetector noise, and the presence of aerosols and clouds in the column.After applying quality-control flags based on aerosol, cloud, albedo conditions,some data are declared unreliable and hence “missing.” The ideal is to estimatethe (dry air mole fraction) CO 2 amount, Y (x, y, h; t) in units of ppm, ash varies down the atmospheric column centred at (x, y), at time t. Whenthecolumn is divided up into layers centred at geo-potential heights h 1 ,...,h K ,we may writeY 0 (x, y; t) ≡ (Y (x, y, h 1 ; t),...,Y(x, y, h K ; t)) ⊤ , (38.13)as the scientific process (i.e., state) of interest. The dimension of the statevector (38.13) is 20 for OCO-2, although 40 or so additional state variables,Y 1 (x, y; t), are incorporated into Y(x, y; t) ≡ (Y 0 (x, y; t) ⊤ , Y 1 (x, y; t) ⊤ ) ⊤ ,from which the radiative-transfer relation can be modeled asZ(x, y; t) =F θ {Y(x, y; t)} + ε(x, y; t). (38.14)In (38.14), the functional form of F θ is known (approximately) from thephysics, but typically it requires specification of parameters θ. If θ wereknown, (38.14) is simply the data model, [Z(x, y; t)|Y,θ] on the right-handside of (38.4). The process model, [Y |θ], on the right-hand side of (38.4) isthe joint distribution of Y ≡{Y(x, y; t) :(x, y) ∈ D g ,t ∈ D t }, whose individualmultivariate distributions are specified by OCO-2 ATB Document(2010) to be multivariate Gaussian with mean vectors and covariance matricescalculated from forecast fields produced by the European Centre forMedium-Range Weather Forecasting (ECMWF). However, this specificationof the multivariate marginal distributions does not specify spatial dependenciesin the joint distribution, [Y |θ]. Furthermore, informed guesses are madefor the parameters in θ. Thepredictivedistributionisgivenby(38.8),butthisis not computed; a summary is typically used (e.g., the predictive mode). Formore details, see Crisp et al. (2012) and O’Dell et al. (2012). Validation ofthe estimated CO 2 values is achieved through TCCON data from a globallysparse but carefully calibrated network of land-based, upward-looking CO 2monitoring sites; see, e.g., Wunch et al. (2011).Ubiquitously in the literature on remote sensing retrievals (Rodgers, 2000),it is the predictive mode of [Y(x, y; t)|Z(x, y; t),θ] that is chosen as the optimalestimator of Y(x, y; t). The subsequent error analysis in that literature is thenconcerned with deriving the mean vector and the covariance matrix of thisestimator assuming that F θ is a linear function of its state variables (Connoret al., 2008). However, the atmosphere involves highly complex interactions,and the radiative transfer function is known to be highly non-linear.


N. Cressie 437FIGURE 38.1Locations of six GOSAT soundings where retrievals of XCO2 were obtained(between June 5, 2009 and July 26, 2009).In Cressie and Wang (2013), we enhanced the linear approximation byincluding the quadratic term of a second-order Taylor-series approximation,and we calculated the non-linearity biases of retrievals of CO 2 that were obtainedfrom data collected by the GOSAT satellite at the locations in Australiashown in Figure 38.1. For the six retrievals (i.e., predictive modes), we calculatedthe following biases of column-averaged CO 2 , or XCO2 (in units ofppm): .86, 1.15,.19, 1.15, −.78, and 1.40. Biases of this order of magnitude areconsidered to be important, and hence a systematic error analysis of remotesensing retrievals should recognize the non-linearity in F θ .It is important to note here that the predictive distribution, [Y(x, y; t)|Z(x, y; t),θ], is different from the predictive distribution, [Y(x, y; t)|Z, θ], andI propose that it is the latter that we should use when computing the optimalestimate of Y(x, y; t) from(38.9).Thisisbasedontheleft-handsideof(38.8),which represents the “gold standard” to which all approximations should becompared. In practice, it would be difficult to obtain the predictive distribution,[Y(x, y; t)|Z, θ], for every retrieval, so it makes sense to summarizeit with its first two moments. In future research, I shall compare the linearapproximations of Connor et al. (2008) to the quadratic approximations ofCressie and Wang (2013) by comparing them to the gold standard.


438 Environmental informaticsThe mode should be considered to be just one possible summary of thepredictive distribution; its corresponding loss function is{L(Y,Ŷ )= 0 if Y = Ŷ,1 if Y ≠ Ŷ ;see, e.g., Berger (1985). I shall refer to this as the 0–1 loss function. That is,should even one element of the approximately 60-dimensional estimated statevector miss its target, a fixed loss is declared, no matter how close it is to themissed target. And the same fixed loss is declared when all or some of theelements miss their targets, by a little or a lot. From this decision-theoreticpoint of view, the predictive mode looks to be an estimate that in this contextis difficult to justify.The next phase of the analysis considers the dry air mole fraction (inppm) of CO 2 averaged through the column from Earth’s surface to the satellite,which recall is called XCO2. Let Y ∗ 0(x, y; t) denotethepredictivemodeobtained from (38.8), which is the optimal estimate given by (38.9) with the0-1 loss function. Then XCO2(x, y; t) isestimatedbŷXCO2(x, y; t) ≡ Y ∗ 0(x, y; t) ⊤ w, (38.15)where the weights w are given in OCO-2 ATB Document (2010). From thispoint of view, ̂XCO2(x, y; t) is the result of applying a smoother f to theraw radiances Z(x, y; t). The set of “retrieval data” over the time period D tare {̂XCO2(x i ,y i ; t i ) : i = 1,...,n} given by (38.15), which we saw from(38.12) can be written as ˜Z; and Y is the multivariate spatio-temporal field{Y(x, y; t) :(x, y) ∈ D g ,t ∈ D t },whererecallthatD g is the geoid and theperiod of interest D t might be a month, say.The true column-averaged CO 2 field over the globe is a function of Y ,viz.g V (Y ) ≡{XCO2(x, y; t) :(x, y) ∈ D g ,t∈ D t } , (38.16)where the subscript V signifies vertical averaging of Y through the column ofatmosphere from the satellite’s footprint on the Earth’s surface to the satellite.Then applying the principles set out in the previous sections, we need toconstruct spatio-temporal probability models for [ ˜Z|Y,θ] and [Y |θ], and eitheraprior[θ] or an estimate ˆθ of θ. This will yield the predictive distribution of Yand hence that of g V (Y ). Katzfuss and Cressie (2011, 2012) have implementedboth the EHM where θ is estimated and the BHM where θ has a prior distribution,to obtain respectively, the empirical predictive distribution and thepredictive distribution of g V (Y )basedon ˜Z.Thenecessarycomputationalefficiencyis achieved by dimension reduction using the Spatio-Temporal RandomEffects (STRE) model; see, e.g., Katzfuss and Cressie (2011). Animated globalmaps of the predictive mean of g V (Y ) using both approaches, based on AIRSCO 2 column averages, are shown in the SSES Web-Project, “Global Mappingof CO 2 ”(seeFigure2atwww.stat.osu.edu/∼sses/collab co2.html). The


N. Cressie 439regional and seasonal nature of CO 2 becomes obvious by looking at thesemaps. Uncertainty is quantified by the predictive standard deviations, andtheir heterogeneity (due to different atmospheric conditions and different samplingrates in different regions) is also apparent from the animated maps.It is worth pointing out that the “smoothed” data, ˜Z ≡{̂XCO2(x i ,y i ; t i ):i =1,...,n}, are different from the original radiances, Z ≡{Z(x i ,y i ; t i ):i =1,...,n}. Thus,[Y | ˜Z,θ] isdifferentfrom[Y |Z, θ]. Basing scientific inferenceon the latter, which contains all the data, is to be preferred, but practicalconsiderations and tradition mean that the information-reduced, ˜Z = f(Z),is used for problems such as flux estimation.Since there is strong interest from the carbon-cycle-science community inregional surface fluxes, horizontal averaging should be a more interpretablesummary of Y than vertical averaging. Let g 1 {Y(x, y; t)} denote the surfaceCO 2 concentration with units of mass/area. For example, this could be obtainedby extrapolating the near-surface CO 2 information in Y 0 (x, y; t). Thendefine∫/ ∫Y (x, y; t) ≡ g 1 {Y (u, v; t)} dudv dudvR(x,y)R(x,y)andg H (Y ) ≡{Y (x, y; t) :(x, y) ∈ D g ,t∈ D t }, (38.17)where the subscript H signifies horizontal averaging, and where R(x, y) isa pre-specified spatial process of areal regions on the geoid that defines thehorizontal averaging. (It should be noted that R could also be made a functionof t, and indeed it probably should change with season.) For a pre-specifiedtime increment τ, define∆(x, y; t) ≡Y (x, y; t + τ) − Y (x, y; t)τwith units of mass/(area × time). Then the flux field isg F (Y ) ≡{∆(x, y; t) :(x, y) ∈ D g ,t∈ D t }. (38.18)At this juncture, it is critical that the vector of estimated CO 2 in the column,namely, Y ∗ 0(x i ,y i ; t i ), replaces ̂XCO2(x i ,y i ; t i )todefinethesmootheddata, ˜Z. Then the data model [˜Z|Y,θ] changes, but critically the spatiotemporalstatistical model for [Y |θ] isthesameasthatusedforverticalaveraging.Recall the equivariance property that if Y is sampled from the predictivedistribution (38.6) or (38.8), the corresponding samples from g H (Y ) andg F (Y )yieldtheircorrespondingpredictivedistributions.TheHMparadigmallows other data sources (e.g., in situ TCCON measurements, data fromother remote sensing instruments) to be incorporated into ˜Z seamlessly; see,e.g., Nguyen et al. (2012).,


440 Environmental informaticsThe choice of τ is at the granularity of D t , and the choice of R depends onthe question being asked and the roughness of Earth’s surface relative to thequestion. In a classical bias-variance trade-off, one wants R(x, y) to be largeenough for g F (x, y; t) to capture the dominant variability and small enoughthat the flux in R(x, y) ishomogeneous.Carbon-cycle science has accounted for much of the dynamics of CO 2 ,but the carbon budget has consistently shown there to be a missing sink(or sinks). The OCO-2 instrument, with its almost pinpoint accuracy andhigh sensitivity near Earth’s surface, offers an unprecedented opportunity toaccurately estimate the sinks. From that point of view, the parts of Y thatare of interest are lower quantiles of g F (Y ), along with the (lon, lat)-regionswhere those quantiles occur. In Section 38.6, I argue that these queries of theprocess g F (Y ) can be formalized in terms of loss functions; Zhang et al. (2008)give an illustration of this for decadal temperature changes over the Americas.This different approach to flux estimation is centrally statistical, and itis based on a spatio-temporal model for [Y |θ]. There is another approach,one that bases [Y |θ] on an atmospheric transport model to incorporate thephysical movement of voxels in the atmosphere and, consequently, the physicalmovement of CO 2 ; see, e.g., Houweling et al. (2004), Chevallier et al. (2007),Gourdji et al. (2008), and Lauvaux et al. (2012). Motivated by articles suchas Gourdji et al. (2008), I expect that the two approaches could be combined,creating a physical-statistical model.When [Y |θ] is different, the predictive distribution given by (38.8) is different,and clearly when L in (38.9) is different, the optimal estimate givenby (38.9) is different. This opens up a whole new way of thinking about fluxestimation and quantifying its uncertainty, which is something I am activelypursuing as part of the OCO-2 Science Team.38.5.2 Regional climate change projectionsClimate is not weather, the latter being something that interests us on dailybasis. Generally speaking, climate is the empirical distribution of temperature,rainfall, air pressure, and other quantities over long time scales (30 years, say).The empirical mean (i.e., average) of the distribution is one possible summary,often used for monitoring trends, although empirical quantiles and extremamay often be more relevant summaries for natural-resource management. Regionalclimate models (RCMs) at fine scales of resolution (20–50 km) producethese empirical distributions over 30-year time periods and can allow decisionmakersto project what environmental conditions will be like 50–60 years inthe future.Output from an RCM is obtained by discretizing a series of differentialequations, coding them efficiently, and running the programs on a fast computer.From that point of view, an RCM is deterministic, and there is nothingstochastic or uncertain about it. However, uncertainties in initial and boundaryconditions, in forcing parameters, and in the approximate physics asso-


N. Cressie 441ciated with the spatial and temporal discretizations (Fennessy and Shukla,2000; Xue et al., 2007; Evans and Westra, 2012), allow us to introduce a probabilitymodel for the output, from which we can address competing risks (i.e.,probabilities) of different projected climate scenarios. The RCM output cancertainly be summarised statistically; in particular, it can be mapped. There isa small literature on spatial statistical analyses of RCMs, particularly of outputfrom the North American Regional Climate Change Assessment Program(NARCCAP), administered by NCAR in Boulder, Colorado; see Kaufman andSain (2010), Salazar et al. (2011), Kang et al. (2012), and Kang and Cressie(2013). My work in this area has involved collaboration with NCAR scientists.Kang and Cressie (2013) give a comprehensive statistical analysis of the11,760 regions (50 × 50 km pixels) in North America, for projected averagetemperature change, projected out to the 30-year-averaging period, 2041–70.The technical features of our article are: it is fully Bayesian; data dimension isreduced from the order of 100,000 down to the order of 100 through a SpatialRandom Effects, or SRE, model (Cressie and Johannesson, 2008); seasonalvariability is featured; and consensus climate-change projections are based onmore than one RCM. Suppose that the quantity of scientific interest, Y ,istemperature change in degrees Celsius by the year 2070, which is modeledstatistically asY (s) =µ(s)+S(s) ⊤ η + ξ(s), s ∈ North America (38.19)where µ captures large-scale trends, and the other two terms on the righthandside of (38.19) are Gaussian processes that represent, respectively, smallscalespatially dependent random effects and fine-scale spatially independentvariability. The basis functions in S include 80 multi-resolutional bisquarefunctions and five indicator functions that capture physical features such aselevation and proximity to water bodies. This defines [Y |θ].Importantly, the 30-year-average temperature change obtained from theNARCCAP output (i.e., the data, Z)ismodeledasthesumofaspatialprocessY and a spatial error term that in fact captures spatio-temporal interaction,viz.Z(s) =Y (s)+ε(s), s ∈ North America (38.20)where ε is a Gaussian white-noise process with a variance parameter σ 2 ε.Thisdefines [Z|Y,θ]. The target for inference is the spatial climate-change processY ,whichis“hidden”behindthespatio-temporal“noisy”processZ. A priordistribution, [θ], is put on θ, and (38.6) defines the predictive distribution.Here, θ is made up of the vector of spatial-mean effects µ, cov(η), var(ξ),and σ 2 ε, and the prior [θ] isspecifiedintheAppendixofKangandCressie(2013). From (38.6), we can deduce the shaded zone of North America in Figure38.2. There, with 97.5% probability calculated pixel-wise, any 50×50 kmpixel’s Y (s) that is above a 2 ◦ Csustainabilitythresholdisshaded.Here,Y andθ together are over 100, 000 dimensional, but the computational algorithmsbased on dimension reduction in the SRE model do not “break.”


442 Environmental informaticsFIGURE 38.2Regions of unsustainable (> 2 ◦ Cwithpredictiveprobability.975)temperatureincrease obtained from pixel-wise predictive distributions, [Y (s)|Z], where s ∈North America.Since MCMC samples are taken from a (more than) 100,000-dimensionalposterior distribution, many such probability maps like Figure 38.2 can beproduced. For example, there is great interest in extreme temperature changes,so let k denote a temperature-change threshold; define the spatial probabilityfield,Pr(Y > k|Z), k ≥ 0. (38.21)As k increases in (38.21), the regions of North America that are particularlyvulnerable to climate change stand out. Decision-makers can query the BHMwhere, for NARCCAP, the query might involve the projected temperatureincrease in the 50 × 50 km pixel containing Columbus, Ohio. Or, it mightinvolve the projected temperature increase over the largely agricultural OlentangyRiver watershed (which contains Columbus). From (38.6), one can obtainthe probabilities (i.e., risks) of various projected climate-change scenarios,which represent real knowledge when weighing up mitigation and adaptationstrategies at the regional scale.This HM approach to uncertainty quantification opens up many possibilities:Notice that the occurrence-or-not of the events referred to above can bewritten as{∫ / ∫ }1{Y (s C ) >k} and 1 Y (s)ds ds >k ,OOwhere 1 is an indicator function, s C is the pixel containing Columbus, and O


N. Cressie 443is the set of all pixels in the Olentangy River watershed. Then squared-errorloss implies that E[1{Y (s C ) >k}|Z] =Pr{Y (s C ) >k|Z} given by (38.21) isoptimal for estimating 1{Y (s C ) >k}. Critically, other loss functions wouldyield different optimal estimates, since (38.7) depends on the loss functionL. A policy-maker’s question translated into a loss function yields a tailoredanswer to that question. Quite naturally in statistical decision theory, differentquestions are treated differently and result in different answers.More critically, the average (climate change over 30 years) Y can be replacedwith an extreme quantile, say the .95 quantile, which I denote here asg (.95) (Y ); this hidden process corresponds to temperature change that couldcause extreme stress to agricultural production in, for example, the HunterValley, NSW, Australia. Such projections for farmers in the “stressed” regionswould be invaluable for planning crop varieties that are more conducive tohigher temperature/lower rainfall conditions. That is, I propose making inferencedirectly on extremal processes, and decisions should be made withloss functions that are tailor-made to the typical “what if” queries made bydecision-makers.Furthermore, output could have non-Gaussian distributions; for example,quantiles of temperature or rainfall would be skewed, for which spatial generalisedlinear models (Diggle et al., 1998; Sengupta and Cressie, 2013) wouldbe well suited: In this framework, (38.20) is replaced with the data model,[Z(s) =z|Y,θ]=EF[z;E{Z(s)|Y (s),θ}], (38.22)which are conditionally independent for pixels s ∈ North America. In (38.22),EF denotes the one-parameter exponential family of probability distributions.Now consider a link function l that satisfies, l[E{Z(s)|Y (s),θ}] =Y (s); on thistransformed scale, climate change Y is modelled as the spatial process givenby (38.19). In Sengupta and Cressie (2013) and Sengupta et al. (2012), wehave developed spatial-statistical methodology for very large remote sensingdatasets based on (38.22), that could be adapted to RCM projections. Thatmethodology gives the predictive distribution, (38.10), which is summarised bymapping the predictive means, the predictive standard deviations (a measureof uncertainty), and the predictive extreme quantiles. Other data models couldalso be used in place of (38.20), such as the extreme-value distributions.Increases in temperature generally lead to decreases in water availability,due to an increase in evaporation. By developing conditional-probability distributionsof [Temperature] and [Rainfall | Temperature], we can infer thejoint behaviour of [Rainfall, Temperature]. This is in contrast to the bivariateanalysis in Sain et al. (2011), and it is a further example of the utility of aconditional-probability modelling approach, here embedded in a multivariatehierarchical statistical model.


444 Environmental informatics38.6 The knowledge pyramidThe knowledge pyramid has data at its base, information at its next tier,knowledge at its third tier, and decision-making at its apex. In the presence ofuncertainty, I propose that EI have at its core the following steps: convert datainto information by exploring the data for structure; convert information intoknowledge by modeling the variability and inferring the etiology; and preparethe knowledge for decision-makers by translating queries into loss functions.These may not be the usual squared-error and 0–1 loss functions, which areoften chosen for convenience. They may be asymmetric and multivariable, toreflect society’s interest in extreme environmental events. Associated with eachloss function (i.e., a query) is an optimal estimator (i.e., a wise answer) basedon minimising the predictive expected loss; see (38.7) where the predictiverisks (i.e., probabilities) and the loss interact to yield an optimal estimator.The societal consequences of environmental change, mitigation, and adaptationwill lead to modeling of complex, multivariate processes in the socialand environmental sciences. Difficult decisions by governments will involvechoices between various mitigation and adaptation scenarios, and these choicescan be made, based on the risks together with the losses that are built intoEI’s uncertainty quantification.38.7 ConclusionsEnvironmental informatics has an important role to play in quantifying uncertaintyin the environmental sciences and giving policy-makers tools to makesocietal decisions. It uses data on the world around us to answer questionsabout how environmental processes interact and ultimately how they affectEarth’s organisms (including Homo sapiens). As is the case for bioinformatics,environmental informatics not only requires tools from statistics and mathematics,but also from computing and visualisation. Although uncertaintyin measurements and scientific theories mean that scientific conclusions areuncertain, a hierarchical statistical modelling approach gives a probabilitydistribution on the set of all possibilities. Uncertainty is no reason for lackof action: Competing actions can be compared through competing Bayes expectedlosses.The knowledge pyramid is a useful concept that data analysis, HM, optimalestimation, and decision theory can make concrete. Some science andpolicy questions are very complex, so I am advocating an HM framework tocapture the uncertainties and a series of queries (i.e., loss functions) about thescientific process to determine an appropriate course of action. Thus, a major


N. Cressie 445challenge is to develop rich classes of loss functions that result in wise answersto important questions.AcknowledgementsI would like to thank Eddy Campbell for his comments on an earlier draft,Rui Wang for his help in preparing Figure 38.1, Emily Kang for her helpin preparing Figure 38.2, and Andrew Holder for his help in preparing themanuscript. This research was partially supported by the NASA Program,NNH11ZDA001N–OCO2 (Science Team for the OCO-2 Mission).ReferencesBanerjee, S., Carlin, B.P., and Gelfand, A.E. (2004). Hierarchical Modelingand Analysis for Spatial Data. ChapmanandHall/CRC,BocaRaton,FL.Banerjee, S., Gelfand, A.E., Finley, A.O., and Sang, H. (2008). Gaussianpredictive process models for large spatial data sets. Journal of the RoyalStatistical Society, Series B, 70:825–848.Barnett, V.D. (2004). Environmental Statistics: Methods and Applications.Wiley, New York.Bayes, T. (1763). An essay towards solving a problem in the doctrineof chances. Philosophical Transactions of the Royal Society of London,53:370–418.Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis.Springer, New York.Berliner, L.M. (1996). Hierarchical Bayesian time-series models. In MaximumEntropy and Bayesian Methods (K. Hanson and R. Silver, Eds.). Kluwer,Dordrecht, pp. 15–22.Chevallier, F., Bréon, F.-M., and Rayner, P.J. (2007). Contribution of theOrbiting Carbon Observatory to the estimation of CO2 sources and sinks:Theoretical study in a variational data assimilation framework. Journalof Geophysical Research, 112,doi:10.1029/2006JD007375.Connor, B.J., Boesch, H., Toon, G., Sen, B., Miller, C., and Crisp, D.(2008). Orbiting Carbon Observatory: Inverse method and prospective


446 Environmental informaticserror analysis. Journal of Geophysical Research: Atmospheres, 113,doi:10.1029/2006JD008336.Cressie, N. (1993). Statistics for Spatial Data. Wiley,NewYork.Cressie, N. and Johannesson, G. (2006). Spatial prediction for massive datasets. In Australian Academy of Science Elizabeth and Frederick WhiteConference. Australian Academy of Science, Canberra, Australia, pp. 1–11.Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatialdata sets. Journal of the Royal Statistical Society, Series B, 70:209–226.Cressie, N. and Wang, R. (2013). Statistical properties of the state obtainedby solving a nonlinear multivariate inverse problem. Applied StochasticModels in Business and Industry, 29:424–438.Cressie, N. and Wikle, C.K. (2011). Statistics for Spatio-Temporal Data. Wiley,Hoboken, NJ.Crisp, D., Deutscher, N.M., Eldering, A., Griffith, D., Gunson, M., Kuze, A.,Mandrake, L., McDuffie, J., Messerschmidt, J., Miller, C.E., Morino, I.,Fisher, B.M., Natraj, V., Notholt, J., O’Brien, D.M., Oyafuso, F., Polonsky,I., Robinson, J., Salawitch, R., Sherlock, V., Smyth, M., Suto, H.,O’Dell, C., Taylor, T.E., Thompson, D.R., Wennberg, P.O., Wunch, D.,Yung, Y.L., Frankenberg, C., Basilio, R., Bosch, H., Brown, L.R., Castano,R. and Connor, B. (2012). The ACOS XCO2 retrieval algorithm —Part II: Global XCO2 data characterization. Atmospheric MeasurementTechniques, 5:687–707.Crisp, D., Jacob, D.J., Miller, C.E., O’Brien, D., Pawson, S., Randerson, J.T.,Rayner, P., Salawitch, R.J., Sander, S.P., Sen, B., Stephens, G.L., Atlas,R.M., Tans, P.P., Toon, G.C., Wennberg, P.O., Wofsy, S.C., Yung, Y.L.,Kuang, Z., Chudasama, B., Sprague, G., Weiss, B., Pollock, R., Bréon,F.-M., Kenyon, D., Schroll, S., Brown, L.R., Burrows, J.P., Ciais, P.,Connor, B.J., Doney, S.C., and Fung, I.Y. (2004). The Orbiting CarbonObservatory (OCO) mission. Advances in Space Research, 34:700–709.Diggle, P.J., Tawn, J.A., and Moyeed, R.A. (1998). Model-based geostatistics(with discussion). Journal of the Royal Statistical Society, Series C,47:299–350.Evans, J.P. and Westra, S. (2012). Investigating the mechanisms of diurnalrainfall variability using a regional climate model. Journal of Climate,25:7232–7247.Fennessy, M.J. and Shukla, J. (2000). Seasonal prediction over North Americawith a regional model nested in a global model. Journal of Climate,13:2605–2627.


N. Cressie 447Gourdji, S.M., Mueller, K.L., Schaefer, K., and Michalak, A.M. (2008).Global monthly averaged CO 2 fluxes recovered using a geostatistical inversemodeling approach: 2. Results including auxiliary environmentaldata. Journal of Geophysical Research: Atmospheres, 113,doi:10.1029/2007JD009733.Hastie, T., Tibshirani, R.J., and Friedman, J.H. (2009). The Elements ofStatistical Learning: Data Mining, Inference, and Prediction, 2nd edition.Springer, New York.Houweling, S., Bréon, F.-M., Aben, I., Rödenbeck, C., Gloor, M., Heimann,M., and Ciasis, P. (2004). Inverse modeling of CO 2 sources and sinksusing satellite data: A synthetic inter-comparison of measurement techniquesand their performance as a function of space and time. AtmosphericChemistry and Physics, 4:523–538.Kang, E.L. and Cressie, N. (2011). Bayesian inference for the spatial randomeffects model. Journal of the American Statistical Association, 106:972–983.Kang, E.L. and Cressie, N. (2013). Bayesian hierarchical ANOVA of regionalclimate-change projections from NARCCAP Phase II. International Journalof Applied Earth Observation and Geoinformation, 22:3–15.Kang, E.L., Cressie, N., and Sain, S.R. (2012). Combining outputs from theNARCCAP regional climate models using a Bayesian hierarchical model.Journal of the Royal Statistical Society, Series C, 61:291–313.Katzfuss, M. and Cressie, N. (2011). Spatio-temporal smoothing and EMestimation for massive remote-sensing data sets. Journal of Time SeriesAnalysis, 32:430–446.Katzfuss, M. and Cressie, N. (2012). Bayesian hierarchical spatio-temporalsmoothing for very large datasets. Environmetrics, 23:94–107.Kaufman, C.G. and Sain, S.R. (2010). Bayesian ANOVA modeling usingGaussian process prior distributions. Bayesian Analysis, 5:123–150.Lauvaux, T., Schuh, A.E., Uliasz, M., Richardson, S., Miles, N., Andrews,A.E., Sweeney, C., Diaz, L.I., Martins, D., Shepson, P.B., and Davis, K.(2012). Constraining the CO 2 budget of the corn belt: Exploring uncertaintiesfrom the assumptions in a mesoscale inverse system. AtmosphericChemistry and Physics, 12:337–354.Lindgren, F., Rue, H., and Lindström, J. (2011). An explicit link betweenGaussian fields and Gaussian Markov random fields: The stochastic partialdifferential equation approach (with discussion). Journal of the RoyalStatistical Society, Series B, 73:423–498.


448 Environmental informaticsMcLachlan, G.J. and Krishnan, T. (2008). The EM Algorithm and Extensions,2nd edition. Wiley, New York.Nguyen, H., Cressie, N., and Braverman, A. (2012). Spatial statistical datafusion for remote-sensing applications. Journal of the American StatisticalAssociation, 107:1004–1018.OCO-2 ATB Document (2010). OCO-2 level 2 full physics retrievalalgorithm theoretical basis. http://disc.sci.gsfc.nasa.gov/acdisc/documentation/OCO-2_L2_FP_ATBD_v1_rev4_Nov10.pdf.O’Dell, C.W., Fisher, B., Gunson, M., McDuffie, J., Miller, C.E., Natraj, V.,Oyafuso, F., Polonsky, I., Smyth, M., Taylor, T., Toon, G., Connor, B.,Wennberg, P.O., Wunch, D., Bosch, H., O’Brien, D., Frankenberg, C.,Castano, R., Christi, M., Crisp, D., and Eldering, A. (2012). The ACOSCO2 retrieval algorithm — Part I: Description and validation againstsynthetic observations. Atmospheric Measurement Techniques, 5:99–121.Robert, C.P. and Casella, G. (2004). Monte Carlo Statistical Methods, 2ndedition. Springer, New York.Rodgers, C.D. (2000). Inverse Methods for Atmospheric Sounding. WorldScientific Publishing, Singapore.Rue, H., Martino, S., and Chopin, N. (2009). Approximate Bayesian inferencefor latent Gaussian models by using integrated nested Laplace approximations(with discussion). Journal of the Royal Statistical Society, SeriesB, 71:319–392.Sain, S.R., Furrer, R., and Cressie, N. (2011). Combining ensembles of regionalclimate model output via a multivariate Markov random fieldmodel. The Annals of Applied Statistics, 5:150–175.Salazar, E.S., Finley, A., Hammerling, D., Steinsland, I., Wang, X., and Delamater,P. (2011). Comparing and blending regional climate model predictionsfor the American southwest. Journal of Agricultural, Biological,and Environmental Statistics, 16:586–605.Sengupta, A. and Cressie, N. (2013). Hierarchical statistical modeling ofbig spatial datasets using the exponential family of distributions. SpatialStatistics, 4:14–44.Sengupta, A., Cressie, N., Frey, R., and Kahn, B. (2012). Statistical modelingof MODIS cloud data using the spatial random effects model. In2012 Proceedings of the Joint Statistical Meetings, American StatisticalAssociation, Alexandria, VA, pp. 3111–3123.Wikle, C.K. and Cressie, N. (1999). A dimension-reduced approach to spacetimeKalman filtering. Biometrika, 86:815–829.


N. Cressie 449Wikle, C.K., Milliff, R.F., Nychka, D. and Berliner, L.M. (2001). Spatiotemporalhierarchical Bayesian modeling: Tropical ocean surface winds. Journalof the American Statistical Association, 96:382–397.Wunch, D., Ahonen, P., Biraud, S.C., Castano, R., Cressie, N., Crisp, D.,Deutscher, N.M., Eldering, A., Fisher, M.L., Griffith, D. W.T., Gunson,M., Wennberg, P.O., Heikkinen, P., Keppel-Aleks, G., Kyro, E., Lindenmaier,R., Macatangay, R., Mendonca, J., Messerschmidt, J., Miller, C.E.,Morino, I., Notholt, J., Toon, G.C., Oyafuso, F.A., Rettinger, M., Robinson,J., Roehl, C.M., Salawitch, R.J., Sherlock, V., Strong, K., Sussmann,R., Tanaka, T., Thompson, D.R., Connor, B.J., Uchino, O., Warneke, T.,Wofsy, S.C., Fisher, B., Osterman, G.B., Frankenberg, C., Mandrake,L., and O’Dell, C. (2011). A method for evaluating bias in global measurementsof CO2 total columns from space. Atmospheric Chemistry andPhysics, 11:12317–12337.Xue, Y., Vasic, R., Janjic, Z., Mesinger, F., and Mitchell, K.E. (2007). Assessmentof dynamic downscaling of the continental US regional climate usingthe Eta/SSiB regional climate model. Journal of Climate, 20:4172–4193.Zhang, J., Craigmile, P.F., and Cressie, N. (2008). Loss function approachesto predict a spatial quantile and its exceedance region. Technometrics,50:216–227.


39AjourneywithstatisticalgeneticsElizabeth A. ThompsonDepartment of StatisticsUniversity of Washington, Seattle, WAWith the work of R.A. Fisher and Sewall Wright, the early years of the developmentof methods for analysis of genetic data were closely paralleled bythe broader development of methods for statistical inference. In many ways,the parallel over the last 40 years is no less striking, with genetic and genomicdata providing an impetus for development of broad new areas of statisticalmethodology. While molecular and computational technologies have changedout of all recognition over the last 40 years, the basic questions remain thesame: Where are the genes? What do they do? How do they do it? Thesequestions continue to provide new challenges for statistical science.39.1 Introduction“Plus ça change, plus c’est la même chose.”–AlphonseKarr(1849)No doubt when things work out well, past events seem opportune, butIneverceasetomarvelhowincrediblyluckyIwastoenterthefieldofstatisticalgenetics in 1970. Two foundational books on the theory (Crow andKimura, 1970) and application (Cavalli-Sforza and Bodmer, 1971) of populationgenetics were newly published. Together with the new edition of Stern(1973), these were the bibles of my early graduate-student years. While theavailable data seem primitive by today’s standards, the extensive updates in1976 of the earlier work of Mourant (1954) and colleagues provided a widerview of the genetic variation among human populations than had been previouslyavailable. Computing power for academic research was also fast expanding,with the new IBM 370 series in 1970, and virtual memory capabilitiessoon after.451


452 Statistical geneticsTABLE 39.1The changing study designs and data structures and relationship to developingstatistical approaches.Date Data Design/Structure Statistical Approach1970s pedigrees; evolutionary trees Latent variables and EM1980s genetic maps HMM methodslinkage analysisGraphical models1990s more complex traits Monte Carlo likelihood; MCMCmore complex pedigrees Complex stochastic systems2000s large scale association mapping FDR, p ≫ n2010s descent of genomes; IBD MC realization of latent structureIt is often commented, how, in an earlier era, genome science and statisticalscience developed in parallel. In 1922, Fisher started to develop likelihoodinference (Fisher, 1922a), while in the same year his first example of maximumlikelihood estimation was of estimating genetic recombination frequencies indrosophila (Fisher, 1922b). His first use of the term variance was in developingthe theory of genetic correlations among relatives (Fisher, 1918), while analysisof variance was established in Fisher (1925). In parallel, Wright (1922)was also developing the theory of the dependence structure of quantitativegenetic variation among related individuals, leading to the theory of path coefficients(Wright, 1921) and structural equation modeling (Pearl, 2000). Thechanges in genetics and genomics, statistical science, and both molecular andcomputational technologies over the last 40 years (1970–2010) are arguablymany times greater than over the preceding 40 (1930–1970), but the samecomplementary developments of statistics and genetics are as clear as thoseof Fisher and Wright; see Table 39.1. Moreover the basic scientific questionsremain the same: Where are the genes? What do they do? How do they do it?39.2 The 1970s: Likelihood inference and theEM algorithmThe basic models of genetics are fundamentally parametric. The dependencestructure of data on a pedigree dates to Elston and Stewart (1971) and isshown in Figure 39.1(a). First, population-level parameters provide the probabilitiesof genotypes G (F ) of founder members, across a small set of marker(M) or trait (T ) loci. Parameters of the process of transmission of DNA fromparents to offspring then determine the probabilities of the genotypes (G)


E.A. Thompson 453PopulationMeiosisG (F )M3G (F )TG (F )M2G (F )M1FoundersS M3S TS M2S M1Joint TransmissionG M3G TG M2G M1AllHaplotypesY M3Y TTraitPenet.Y M2Y M1MarkerPenet.TraitPenet.MarkerErrorA (F )M3 A (F )T A (F )M2 A (F )M1Y M3Y TY M2Y M1MarkerPop nTraitPop nMarkerPop nMarkerPop nFIGURE 39.1The two orthogonal conditional independence structures of genetic data on relatedindividuals. (a) Left: The conditional independence of haplotypes amongindividuals. (b) Right: The conditional independence of inheritance amonggenetic loci. Figures from Thompson (2011), reproduced with permission ofS. Karger AG, Basel.of descendant individuals. Finally, penetrance parameters specify the probabilisticrelationship between these genotypes and observable genetic data (Y )on individuals, again at each trait (T ) or marker (M) locus.Thesepenetrancemodels can incorporate typing error in marker genotypes, as well asmore complex relationships between phenotype and genotype. The structuredparametric models of statistical genetics lead naturally to likelihood inference(Edwards, 1972), and it is no accident that from Fisher (1922b) onwards, amajor focus has been the computation of likelihoods and of maximum likelihoodestimators.Methods for the computation of likelihoods on pedigree structures makeuse of the conditional independence structure of genetic data. Under the lawsof Mendelian genetics, conditional on the genotypes of parents, the genotypesof offspring are independent of each other, and of those of any ancestral andlateral relatives of the parents. Data on individuals depends only on their genotypes;see Figure 39.1(a). Methods for computation of probabilities of observeddata on more general graphical structures are now widely known (Lauritzenand Spiegelhalter, 1988), but these methods were already standard in pedigreeanalysis in the 1970s. In fact the first use of this conditional independence incomputing probabilities of observed data on three-generation pedigrees datesto Haldane and Smith (1947), while the Elston–Stewart algorithm (Elston and


454 Statistical geneticsStewart, 1971) provided a general approach for simple pedigrees. Extensionof this approach to complex pedigree structures (Cannings et al., 1978) andto more general trait models (Cannings et al., 1980) was a major advance instatistical genetics of the 1970s, enabling inference of gene ancestry in humanpopulations (Thompson, 1978), the computation of risk probabilities in thecomplex pedigrees of genetic isolates (Thompson, 1981) and the analysis ofgene extinction in the pedigrees of endangered species (Thompson, 1983).Statistical genetics is fundamentally a latent variable problem. The underlyingprocesses of descent of DNA cannot be directly observed. The observeddata on individuals result from the types of the DNA they carry at certaingenome locations, but these locations are often unknown. In 1977, the EMalgorithm (Dempster et al., 1977) was born. In particular cases it had existedmuch earlier, in the gene-counting methods of Ceppelini et al. (1955), in thevariance component methods of quantitative genetics (Patterson and Thompson,1971) and in the reconstruction of human evolutionary trees (Thompson,1975), but the EM algorithm provided a framework unifying these examples,and suggesting approaches to maximum likelihood estimation across a broadrange of statistical genetics models.39.3 The 1980s: Genetic maps and hidden MarkovmodelsThe statistical methodology of human genetic linkage analysis dates back tothe 1930s work of J.B.S. Haldane (1934) and R.A. Fisher (1934), and the likelihoodframework for inferring and estimating linkage from human family datawas established in the 1950s by work of C.A.B. Smith (1953) and N.E. Morton(1955). However, genetic linkage findings were limited: there were no geneticmarker maps.That suddenly changed in 1980 (Botstein et al., 1980), with the arrivalof the first DNA markers, the restriction fragment polymorphisms or RFLPs.For the first time, there was the vision we now take for granted, of geneticmarkers available at will throughout the genome, providing the frameworkagainst which traits could be mapped. This raised new statistical questions inthe measurement of linkage information (Thompson et al., 1978). The developmentof DNA markers progressed from RFLP to (briefly) multilocus variablenumber of tandem repeat or VNTR loci used primarily for relationshipinference (Jeffreys et al., 1991; Geyer et al., 1993) and then to STR (shorttandem repeat or microsatellite) loci(Murrayetal.,1994);seeTable39.2.These DNA markers, mapped across the genome, brought a whole new frameworkof conditional independence to the computation of linkage likelihoods(Lander and Green, 1987; Abecasis et al., 2002). Rather than the conditionalindependence in the transmission of DNA from parents to offspring, the rel-


E.A. Thompson 455TABLE 39.2The changing pattern of genetic data (1970–2010).Date Marker Type Data Structure Trait Type1970 Blood types Nuclear families Mendelian1980 RFLPs Large pedigrees Simple traits1990 STRs Small pedigrees Quantitative traits(Microsatellites)2000 SNPs and Case/Control Complex traitsmRNA expression data (“unrelated”)2010 RNAseq and Relatives in ComplexSequence data populations quantitative traitsevant structure became the Markov dependence of inheritance (S) of DNAat successive marker (M) or hypthesized trait (T) locations across a chromosome,as shown in Figure 39.1(b). As before the population model providesprobabilities of the allelic types (A) of founders (F ), at trait (T ) or marker(M) loci. At a locus, the observable data (Y )isdeterminedbythefounderallelic types (A) and the inheritance (S) at that locus, possibly through apenetrance model in the case of trait loci.39.4 The 1990s: MCMC and complex stochastic systemsThe earlier methods (Figure 39.1(a)) are computationally exponential in thenumber of genetic loci analyzed jointly, the hidden Markov model (HMM)methods (Figure 39.1(b)) are exponential in the number of meioses in a pedigree.Neither could address large numbers of loci, observed on large numbersof related individuals. However, the same conditional independence structuresthat make possible the computation of linkage likelihoods for few markers orfor small pedigrees, lend themselves to Markov chain Monte Carlo (MCMC).Genetic examples, as well as other scientific areas, gave impetus to the hugeburst of MCMC in the early 1990s. However, unlike other areas, where MCMCwas seen as a tool for Bayesian computation (Gelfand and Smith, 1990; Besagand Green, 1993) in statistical genetics the focus on likelihood inferenceled rather to Monte Carlo likelihood (Penttinen, 1984; Geyer and Thompson,1992).The discreteness and the constraints of genetic models provided challengesfor MCMC algorithms. Earlier methods (Lange and Sobel, 1991) usedthe genotypes of individuals as latent variables (Figure 39.1(a)) and encoun-


456 Statistical geneticslocidata • j•••Y • ,j••imeiosesFIGURE 39.2Imputation and specification of inheritance of DNA. (a) Left: The MCMCstructure of meiosis and linkage (Thompson, 2000). (b) Right: The IBD graphspecifying DNA shared by descent among observed individuals (Thompson,2003).tered problems of non-irreducibility of samplers (Sheehan and Thomas, 1993)and other mixing problems (Lin et al., 1993). Significant improvements wereobtained by instead using the inheritance patterns of Figure 39.1(b) as latentvariables (Thompson, 1994). The aprioriindependence of meioses, theMarkov dependence of inheritance at successive loci, and the dependence ofobservable data on the inheritance pattern at a locus (Figure 39.2(a)) leadto a variety of block-Gibbs samplers of increasing computational efficiency(Heath, 1997; Thompson and Heath, 1999; Tong and Thompson, 2008).An additional aspect of these later developments is the change from use ofthe conditional independence structure of a pedigree to that of an IBD-graph(Figure 39.2(b)). This graph specifies the individuals who share genome identicalby descent (IBD) at a locus; that is, DNA that has descended to currentindividuals from a single copy of the DNA in a recent common ancestor. Theobserved trait phenotypes of individuals are represented by the edges of theIBD graph. The nodes of the graph represent the DNA carried by these individuals;the DNA types of the nodes are independent. Each individual’s edgejoins the two DNA nodes which he/she caries at the locus, and his/her traitphenotype is determined probabilistically by the latent allelic types of thesetwo nodes. In the example shown in Figure 39.2(b), individuals D, G, and Fall share IBD DNA at this locus, as represented by the node labeled 4. Also,individuals B and J share both their DNA nodes, while C carries two copies ofa single node. Computation of probabilities of observed data on a graph suchas that of Figure 39.2(b) is described by Thompson (2003). This approachgives a much closer parallel to graphical models in other areas of statisticalscience; see, e.g., Lauritzen (1996) and Pearl (2000).


E.A. Thompson 45739.5 The 2000s: Association studies and gene expressionSTR markers are highly variable, but expensive to type, and occur relativelysparsely in the genome. The advent of single nucleotide polymorphism (SNP)markers in essentially unlimited numbers (International HapMap Consortium,2005) brought a new dimension and new issues to the genetic mapping of complextraits. Genome-wide association studies (GWAS) became highly populardue to their expected ability to locate causal genes without the need for pedigreedata. However, early GWAS were underpowered. Only with large-scalestudies (Wellcome Trust, 2007) and better methods to control for populationstructure and heterogeneity (Price et al., 2006) did association methods startto have success. Modern GWAS typically consider a few thousand individuals,each typed for up to one million SNPs. New molecular technologies alsoprovided new measures of gene expression variation based on the abundanceof mRNA transcripts (Schena et al., 1995). Again the statistical question isone of association of a trait or sample phenotype with the expression of somesmall subset of many thousands of genes.The need to make valid statistical inferences from both GWAS and fromgene expression studies prompted the development of new general statisticalapproaches. Intrinsic to these problems is that the truth may violate the nullhypothesis in many (albeit a small fraction) of the tests of significance made.This leads to a focus on false discovery rates rather than p-values (Storey,2002, 2003). Both GWAS and gene expression studies also exhibit the modernphenomenon of high-dimensional data (p ≫ n) or very large numbers ofobservations on relatively few subjects (Cai and Shen, 2010), giving scope fornew methods for dimension reduction and inducing sparsity (Tibshirani et al.,2005).Genomic technologies continue to develop, with cDNA sequence data replacingSNPs (Mardis, 2008) and RNAseq data (Shendure, 2008) replacingthe more traditional microarray expression data. Both raise new statisticalchallenges. The opportunities for new statistical modeling and inference areimmense:“... next-generation [sequencing] platforms are helping to open entirelynew areas of biological inquiry, including the investigation of ancientgenomes, the characterization of ecological diversity, and the identificationof unknown etiologic agents.” (Mardis, 2008)But so also are the challenges:“Although these new [RNAseq] technologies may improve the qualityof transcriptome profiling, we will continue to face what has probablybeen the larger challenge with microarrays — how best to generatebiologically meaningful interpretations of complex datasets that are


458 Statistical geneticssufficiently interesting to drive follow-up experimentation.” (Shendure,2008)39.6 The 2010s: From association to relatednessFor several reasons there is a move back from population studies to a considerationof related individuals. As the sizes of case-control GWAS grow, problemsof population structure increase (Price et al., 2006). Further, these samples oftencontain related individuals. Closely related individuals may be discarded,but large numbers of distant relatives also impact results. In addition to otherheterogeneity between cases and controls, the relationship structure of thecase sample may differ from that of the controls. Secondly, GWAS are predicatedon the “common disease, common variant” model, but there is growingrecognition of the role of rare variants in many diseases (Cohen et al., 2004).There are many different mutations that can affect the function of any givengene, and many different genes that function jointly in gene networks. Whileassociation tests for rare variant effects in GWAS designs have been developed(Madsen and Browning, 2009), the use of inferred shared descent can provideamorepowerfulapproach(BrowningandThompson,2012).Not only does using family information in conjunction with associationtesting provide more power (Thornton and McPeek, 2007, 2010), but, usingmodern SNP data, genome shared IBD (Section 39.4) can be detected amongindividuals not known to be related (Brown et al., 2012; Browning and Browning,2012). Once IBD in a given region of the genome is inferred from geneticmarker data, whether using a known pedigree or from population data, itssource is irrelevant. The IBD graph (Figure 39.2(b)) summarizes all the relevantinformation for the analysis of trait data on the observed individuals.The use of inferred IBD, or more generally estimated relatedness (Lee et al.,2011), is becoming the approach of choice in many areas of genetic analysis.39.7 To the futureComputational and molecular technologies change ever faster, and the relevantprobability models and statistical methodologies will likewise change. For theresearchers of the future, more important than any specific knowledge is theapproach to research. As has been said by statistical philosopher Ian Hacking:“Statisticians change the world not by new methods and techniquesbut by ways of thinking and reasoning under uncertainty.”


E.A. Thompson 459I have given many lectures in diverse academic settings, and received manygenerous and kind introductions, but one which I treasure was given at a recentseminar visit to a Department of Statistics. I was introduced by one of myformer PhD students, who said that I had taught him three things:Think science: Think positive: Think why.Think science: For me, the scientific questions motivate the statistical thinking.Although, as a student taking exams, I did better with the clearly definedworld of mathematical proofs than with the uncertainties of statistical thinking,I could never have become a research mathematician. Answering examquestions was easy; knowing what questions to ask was for me impossible. Fortunately,genetic science came to my rescue: there the questions are endlessand fascinating.Think positive: Another of my former students has said that he got throughhis (excellent) PhD work, because whenever he came to me in despair thathis results were not working out, my response was always “But that’s reallyinteresting”. Indeed, many things in research do not work out the way weexpect, and often we learn far more from what does not work than from whatdoes.Think why: And when it does not work (or even when it does) the first andmost important question is “Why?” (Thompson, 2004). If there is anythingthat distinguishes the human species from other organisms it is not directlyin our DNA, but in our capacity to ask “Why?”. Inresearch,atleast,thisisthe all-important question.Think science: Think positive: Think why.If my students have learned this from me, this is far more important to theirfutures and to their students’ futures than any technical knowledge I couldhave provided.ReferencesAbecasis, G.R., Cherny, S.S., Cookson, W.O., and Cardon, L.R. (2002). Merlin— rapid analysis of dense genetic maps using sparse gene flow trees.Nature Genetics, 30:97–101.Besag, J.E. and Green, P.J. (1993). Spatial statistics and Bayesian computation.Journal of the Royal Statistical Society, Series B, 55:25–37.Botstein, D., White, R.L., Skolnick, M.H., and Davis, R.W. (1980). Constructionof a linkage map in man using restriction fragment polymorphism.American Journal of Human Genetics, 32:314–331.


460 Statistical geneticsBrown, M.D., Glazner, C.G., Zheng, C., and Thompson, E.A. (2012). Inferringcoancestry in population samples in the presence of linkage disequilibrium.Genetics, 190:1447–1460.Browning, S.R. and Browning, B.L. (2012). Identity by descent between distantrelatives: Detection and applications. Annual Review of Genetics,46:617–633.Browning, S.R. and Thompson, E.A. (2012). Detecting rare variant associationsby identity by descent mapping in case-control studies. Genetics,190:1521–1531.Cai, T. and Shen, X., Eds. (2010). High-Dimensional Data Analysis, vol. 2of Frontiers of Science.WorldScientific,Singapore.Cannings, C., Thompson, E.A., and Skolnick, M.H. (1978). Probability functionson complex pedigrees. Advances in Applied Probability, 10:26–61.Cannings, C., Thompson, E.A., and Skolnick, M.H. (1980). Pedigree analysisof complex models. In Current Developments in Anthropological Genetics,J. Mielke and M. Crawford, Eds. Plenum Press, New York, pp. 251–298.Cavalli-Sforza, L.L. and Bodmer, W.F. (1971). The Genetics of Human Populations.W.H.Freeman,SanFrancisco,CA.Ceppelini, R., Siniscalco, M., and Smith, C.A.B. (1955). The estimation ofgene frequencies in a random mating population. Annals of Human Genetics,20:97–115.Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R., andHobbs, H.H. (2004). Multiple rare alleles contribute to low plasma levelsof HDL cholesterol. Science, 305:869–872.Crow, J. and Kimura, M. (1970). An Introduction to Population GeneticsTheory. Harper and Row, New York.Dempster, A.P., Laird, N.M., and Rubin, D.B. (1977). Maximum likelihoodfrom incomplete data via the EM algorithm (with discussion). Journal ofthe Royal Statistical Society, Series B, 39:1–37.Edwards, A.W.F. (1972). Likelihood. Cambridge University Press, Cambridge,UK.Elston, R.C. and Stewart, J. (1971). A general model for the analysis ofpedigree data. Human Heredity, 21:523–542.Fisher, R.A. (1918). The correlation between relatives on the supposition ofMendelian inheritance. Transactions of the Royal Society of Edinburgh,52:399–433.


E.A. Thompson 461Fisher, R.A. (1922). On the mathematical foundations of theoretical statistics.Philosophical Transactions of the Royal Society of London, Series A,222:309–368.Fisher, R.A. (1922). The systematic location of genes by means of crossoverobservations. The American Naturalist, 56:406–411.Fisher, R.A. (1925). Statistical Methods for Research Workers. Oliver &Boyd, Edinburgh, UK.Fisher, R.A. (1934). The amount of information supplied by records of familiesas a function of the linkage in the population sampled. Annals ofEugenics, 6:66–70.Gelfand, A.E. and Smith, A.F.M. (1990). Sampling based approaches to calculatingmarginal densities. Journal of the American Statistical Association,46:193–227.Geyer, C.J., Ryder, O.A., Chemnick, L.G., and Thompson, E.A. (1993).Analysis of relatedness in the California condors from DNA fingerprints.Molecular Biology and Evolution, 10:571–589.Geyer, C.J. and Thompson, E.A. (1992). Constrained Monte Carlo maximumlikelihood for dependent data (with discussion). Journal of the Royal StatisticalSociety, Series B, 54:657–699.Haldane, J.B.S. (1934). Methods for the detection of autosomal linkage inman. Annals of Eugenics, 6:26–65.Haldane, J.B.S. and Smith, C.A.B. (1947). A new estimate of the linkagebetween the genes for colour-blindness and haemophilia in man. Annalsof Eugenics, 14:10–31.Heath, S.C. (1997). Markov chain Monte Carlo segregation and linkage analysisfor oligogenic models. American Journal of Human Genetics, 61:748–760.International HapMap Consortium (2005). A haplotype map of the humangenome. Nature, 237:1299–1319.Jeffreys, A.J., Turner, M., and Debenham, P. (1991). The efficiency of multilocusdna fingerprint probes for individualization and establishmentof family relationships, determined from extensive casework. AmericanJournal of Human Genetics, 48:824–840.Karr, J.-B.A. (1849). Les guêpes. MichelLévy Frères, Paris.Lander, E.S. and Green, P. (1987). Construction of multilocus genetic linkagemaps in humans. Proceedings of the National Academy of Sciences (USA),84:2363–2367.


462 Statistical geneticsLange, K. and Sobel, E. (1991). A random walk method for computing geneticlocation scores. American Journal of Human Genetics, 49:1320–1334.Lauritzen, S.L. (1996). Graphical Models. Oxford University Press, Oxford,UK.Lauritzen, S.L. and Spiegelhalter, D.J. (1988). Local computations with probabilitieson graphical structures and their application to expert systems.Journal of the Royal Statistical Society, Series B, 50:157–224.Lee, S.H., Wray, N.R., Goddard, M.E., and Visscher, P.M. (2011). Estimatingmissing heritability for disease from genome-wide association studies.American Journal of Human Genetics, 88:294–305.Lin, S., Thompson, E.A., and Wijsman, E.M. (1993). Achieving irreducibilityof the Markov chain Monte Carlo method applied to pedigree data. IMAJournal of Mathematics Applied in Medicine and Biology, 10:1–17.Madsen, B.E. and Browning, S.R. (2009). A groupwise association testfor rare mutations using a weighted sum statistic. PLOS Genetics,5:e1000384.Mardis, E.R. (2008). Next-generation DNA sequencing methods. Annual Reviewof Genomics and Human Genetics, 9:387–402.Morton, N.E. (1955). Sequential tests for the detection of linkage. AmericanJournal of Human Genetics, 7:277–318.Mourant, A.E. (1954). The Distribution of the Human Blood Groups. Blackwell,Oxford, UK.Murray, J.C., Buetow, K.H., Weber, J.L., Ludwigsen, S., Heddema, S.T.,Manion, F., Quillen, J., Sheffield, V.C., Sunden, S., Duyk, G.M. et al.(1994). A comprehensive human linkage map with centimorgan density.Science, 265:2049–2054.Patterson, H.D. and Thompson, R. (1971). Recovery of inter-block informationwhen blocks are unequal. Biometrika, 58:545–554.Pearl, J. (2000). Causality: Models, Reasoning, and Inference. CambridgeUniversity Press, Cambridge, UK.Penttinen, A. (1984). Modelling interaction in spatial point patterns: Parameterestimation by the maximum likelihood method. Jyväskylä StudiesinComputer Science, Economics, and Statistics, vol. 7.Price, A.L., Patterson, N.J., Plenge, R.M., Weinblatt, M.E., Shadick, N.A.,and Reich, D. (2006). Principal components analysis corrects for stratificationin genome-wide association studies. Nature Genetics, 38:904–909.


E.A. Thompson 463Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. (1995). Quantitativemonitoring of gene expression patterns with a complementary DNAmicroarray. Science, 270:467–470.Sheehan, N.A. and Thomas, A.W. (1993). On the irreducibility of a Markovchain defined on a space of genotype configurations by a sampling scheme.Biometrics, 49:163–175.Shendure, J. (2008). The beginning of the end for microarrays? Nature Methods,5:585–587.Smith, C.A.B. (1953). Detection of linkage in human genetics. Journal of theRoyal Statistical Society, Series B, 15:153–192.Stern, C. (1973). Principles of Human Genetics, 3rd edition. W.H. Freeman,San Francisco, CA.Storey, J.D. (2002). A direct approach to false discovery rates. Journal of theRoyal Statistical Society, Series B, 64:479–498.Storey, J.D. (2003). The positive false discovery rate: A Bayesian interpretationand the q-value. The Annals of Statistics, 31:2013–2035.Thompson, E.A. (1975). Human Evolutionary Trees. CambridgeUniversityPress, Cambridge, UK.Thompson, E.A. (1978). Ancestral inference II: The founders of Tristan daCunha. Annals of Human Genetics, 42:239–253.Thompson, E.A. (1981). Pedigree analysis of Hodgkin’s disease in a Newfoundlandgenealogy. Annals of Human Genetics, 45:279–292.Thompson, E.A. (1983). Gene extinction and allelic origins in complex genealogies.Philosophical Transactions of the Royal Society of London (SeriesB), 219:241–251.Thompson, E.A. (1994). Monte Carlo estimation of multilocus autozygosityprobabilities. In Proceedings of the 1994 Interface conference (J. Sall andA. Lehman, Eds.). Fairfax Station, VA, pp. 498–506.Thompson, E.A. (2000). Statistical Inferences from Genetic Data on Pedigrees,volume6ofNSF–CBMS Regional Conference Series in Probabilityand Statistics. InstituteofMathematicalStatistics,Beachwood,OH.Thompson, E.A. (2003). Information from data on pedigree structures. InScience of Modeling: Proceedings of AIC 2003, pp.307–316.ResearchMemorandum of the Institute of Statistical Mathematics, Tokyo, Japan.Thompson, E.A. (2004). The importance of Why? The American Statistician,58:198.


464 Statistical geneticsThompson, E.A. (2011). The structure of genetic linkage data: From LIPEDto 1M SNPs. Human Heredity, 71:88–98.Thompson, E.A. and Heath, S.C. (1999). Estimation of conditional multilocusgene identity among relatives. In Statistics in Molecular Biology andGenetics: Selected Proceedings of a 1997 Joint AMS-IMS-SIAM SummerConference on Statistics in Molecular Biology (F. Seillier-Moiseiwitsch,Ed.). IMS Lecture Note–Monograph Series Vol. 33, pp. 95–113. Instituteof Mathematical Statistics, Hayward, CA.Thompson, E.A., Kravitz, K., Hill, J. and Skolnick, M.H. (1978). Linkageand the power of a pedigree structure. In Genetic Epidemiology (N.E.Morton, Ed.). Academic Press, New York, pp. 247–253.Thornton, T. and McPeek, M.S. (2007). Case-control association testing withrelated individuals: A more powerful quasi-likelihood score test. AmericanJournal of Human Genetics, 81:321–337.Thornton, T. and McPeek, M.S. (2010). ROADTRIPS: Case-control associationtesting with partially or completely unknown population and pedigreestructure. American Journal of Human Genetics, 86:172–184.Tibshirani, R.J., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005).Sparsity and smoothness via the fused lasso. Journal of the Royal StatisticalSociety, Series B, 67:91–108.Tong, L. and Thompson, E.A. (2008). Multilocus lod scores in large pedigrees:Combination of exact and approximate calculations. Human Heredity,65:142–153.Wellcome Trust Case Control Consortium (2007). Genome-wide associationstudy of 14,000 cases of seven common diseases and 3,000 shared controls.Nature, 447:661–678.Wright, S. (1921). Correlation and causation. Journal of Agricultural Research,20:557–585.Wright, S. (1922). Coefficients of inbreeding and relationship. American Naturalist,56:330–338.


40Targeted learning: From MLE to TMLEMark van der LaanDivision of Biostatistics, School of Public HealthUniversity of California, Berkeley, CAIn this chapter I describe some of the essential elements of my past scientificjourney from the study of nonparametric maximum likelihood estimation(NPMLE) to the field targeted learning and the resulting new general tooltargeted minimum loss based estimation (TMLE). In addition, I discuss ourcurrent and future research program involving the further development of targetedlearning to deal with dependent data. This journey involved masteringdifficult statistical concepts and ideas, and combining them into an evolvingroadmap for targeted learning from data under realistic model assumptions.Ihopetoconveythemessagethatthisisahighlyinspiringevolvingunifyingand interdisciplinary project that needs input for many future generations tocome, and one that promises to deal with the current and future challengesof statistical inference with respect to a well-defined typically complex targetedestimand based on extremely highly dimensional data structures perunit, complex dependencies between the units, and very large sample sizes.40.1 IntroductionStatistical practice has been dominated by the application of statistical methodsrelying on parametric model assumptions such as linear, logistic, and Coxproportional hazards regression methodology. Most of these methods use maximumlikelihood estimation, but others rely on estimating equations such asgeneralized estimating equations (GEE). These maximum likelihood estimatorsare known to be asymptotically Normally distributed, and asymptoticallyefficient under weak regularity conditions, beyond the key condition that thetrue data generating distribution satisfies the restrictions assumed by theseparametric models.When I started my PhD in 1990, my advisor Richard Gill inspired meto work on maximum likelihood estimation and estimating equation meth-465


466 Targeted learningods for nonparametric or semi-parametric statistical models, with a focus onmodels for censored data (van der Laan, 1996). Specifically, I worked on theconstruction of a semi-parametric efficient estimator of the bivariate survivalfunction based on bivariate right-censored failure time data, generalizing theKaplan–Meier estimator of a univariate survival function. At that time thebook by Bickel et al. (1997) on semi-parametric models was about to appearand earlier versions were circulating. There was an enormous interest amongthe theoreticians, and I had the fortune to learn from various inspiring intellectualleaders such as Richard Gill, Aad van der Vaart, Sara van de Geer,Peter Bickel, Jon Wellner, Richard Dudley, David Pollard, James Robins, andmany more.In order to deal with the challenges of these semi-parametric models I hadto learn about efficiency theory for semi-parametric models, relying on a socalled least favorable parametric submodel for which estimation of the desiredfinite dimensional estimand is as hard as it is in the actual infinite-dimensionalsemi-parametric model. I also had to compute projections in Hilbert spacesto calculate efficient influence curves and corresponding least favorable submodels.Richard Gill taught me how to represent an estimator as a functionalapplied to the empirical distribution of the data, and how to establish functionaldifferentiability of these estimator-functionals. I was taught about thefunctional delta-method which translates a) the convergence in distributionof the plugged in empirical process, and b) the functional differentiability ofthe estimator into the convergence in distribution of the standardized estimatorto a Gaussian process; see, e.g., Gill et al. (1995). I learned how tocompute the influence curve of a given estimator and that it is the objectthat identifies the asymptotic Gaussian process of the standardized estimators.In addition, Aad van der Vaart taught me about weak convergence ofempirical processes indexed by a class of functions, and Donsker classes definedby entropy integral conditions (van der Vaart and Wellner, 1996), whileRichard Gill taught me about models for the intensity of counting processesand continuous time martingales (Andersen et al., 1993). Right after my PhDthesis Jamie Robins taught me over the years a variety of clever methods forcalculating efficient influence curves in complex statistical models for complexlongitudinal data structures, general estimating equation methodologyfor semi-parametric models, and I learned about causal inference for multipletime-point interventions (Robins and Rotnitzky, 1992; van der Laan andRobins, 2003).At that time, I did not know about a large statistical community thatwould have a hard time accepting the formulation of the statistical estimationproblem in terms of a true statistical semi-parametric model, and anestimand/target parameter as a functional from this statistical model to theparameter space, as the way forward, but instead used quotes such as “Allmodels are wrong, but some are useful” to justify the application of wrongparametric models for analyzing data. By going this route, this communitydoes not only accept seriously biased methods for analyzing data in which


M. van der Laan 467confidence intervals and p-values have no real meaning, but it also obstructsprogress by not formulating the true statistical challenge that needs to beaddressed to solve the actual estimation problem. In particular, due to thisresistance, we still see that most graduate students in biostatistics and statisticsprograms do not know much about the above mentioned topics, such asefficiency theory, influence curves, and efficient influence curves. I would nothave predicted at that time that I would be able to inspire new generationswith the very topics I learned in the 1990s.Throughout my career, my only goal has been to advance my understandingof how to formulate and address the actual estimation problems in a largevariety of real world applications, often stumbling on the need for new theoreticaland methodological advances. This has been my journey that startedwith my PhD thesis and is a product of being part of such a rich communityof scientists and young dynamic researchers that care about truth and standfor progress. I try and hope to inspire next generations to walk such journeys,each person in their own individual manner fully utilizing their individualtalents and skills, since it is a path which gives much joy and growth, andthereby satisfaction.In the following sections, I will try to describe my highlights of this scientificjourney, resulting in a formulation of the field targeted learning (van derLaan and Rose, 2012), and an evolving roadmap for targeted learning (Pearl,2009; Petersen and van der Laan, 2012; van der Laan and Rose, 2012) dealingwith past, current and future challenges that require the input for many generationsto come. To do this in a reasonably effective way we start out withproviding some succinct definitions of key statistical concepts such as statisticalmodel, model, target quantity, statistical target parameter, and asymptoticlinearity of estimators. Subsequently, we will delve into the constructionof finite sample robust, asymptotically efficient substitution estimators in realisticsemi-parametric models for experiments that generate complex highdimensional data structures that are representative of the current flood of informationgenerated by our society. These estimators of specified estimandsutilize the state of the art in machine learning and data adaptive estimation,while preserving statistical inference. We refer to the field that is concernedwith construction of such targeted estimators and corresponding statisticalinference as targeted learning.40.2 The statistical estimation problem40.2.1 Statistical modelThe statistical model encodes known restrictions on the probability distributionof the data, and thus represents a set of statistical assumptions. Let’s


468 Targeted learningdenote the observed random variable on a unit with O and let P 0 be the probabilitydistribution of O. In addition, let’s assume that the observed data is arealization of n independent and identically distributed copies O 1 ,...,O n ofO ∼ P 0 . Formally, a statistical model is the collection of possible probabilitydistributions of O, and we denote this set with M.Contrary to most current practice, a statistical model should contain thetrue P 0 ,sothattheresultingestimationproblemisacorrectformulation,and not a biased approximation of the true estimation problem. The famousquote that all statistical models are wrong represents a false statement, sinceit is not hard to formulate truthful statistical models that only incorporatetrue knowledge, such as the nonparametric statistical model that makes noassumptions at all. Of course, we already made the key statistical assumptionthat the n random variables O 1 ,...,O n are independent and identicallydistributed, and, that assumption itself might need to be weakened to a statisticalmodel for (O 1 ,...,O n ) ∼ P0n that contains the true distribution P0 n . Fora historical and philosophical perspective on “models, inference, and truth,”we refer to Starmans (2012).40.2.2 The model encoding both statistical and non-testableassumptionsA statistical model could be represented as M = {P θ : θ ∈ Θ} for some mappingθ ↦→ P θ defined on an infinite-dimensional parameter space Θ. We referto this mapping θ ↦→ P θ as a model, and it implies the statistical model for thetrue data distribution P 0 = P θ0 . There will always exist many models that arecompatible with a particular statistical model. It is important to note that thestatistical model is the only relevant information for the statistical estimationproblem. Examples of models are censored data and causal inference modelsin which case the observed data structure O =Φ(C, X) is represented as amany to one mapping Φ from the full-data X and censoring variable C toO, in which case the observed data distribution is indexed by the full-datadistribution P X and censoring mechanism P C|X .SointhiscaseΘrepresentsthe set of possible (P X ,P C|X ), and P θ is the distribution of Φ(C, X) impliedby the distribution θ of (C, X). Different models for (P X ,P C|X )mightimplythe same statistical model for the data distribution of O. Wenotethata model encodes assumptions beyond the statistical model, and we refer tothese additional assumptions as non-testable assumptions since they put norestrictions on the distribution of the data. Assumptions such as O =Φ(C, X),and the coarsening or missing at random assumption on the conditional distributionP C|X are examples of non-testable assumptions that do not affectthe statistical model.


M. van der Laan 46940.2.3 Target quantity of interest, and its identifiability fromobserved data distributionThe importance of constructing a model is that it allows us to define interestingtarget quantities Ψ F (θ) for a given mapping Ψ F :Θ→ IR d that representthe scientific question of interest that we would like to learn from our dataO 1 ,...,O n .GivensuchadefinitionoftargetparameterΨ F :Θ→ IR d one likesto establish that there exists a mapping Ψ : M→IR d so that Ψ F (θ) =Ψ(P θ ).If such a mapping Ψ exists we state that the target quantity Ψ F (θ) isidentifiablefrom the observed data distribution. This is often only possible bymaking additional non-testable restrictions on θ in the sense that one is onlyable to write Ψ F (θ) =Ψ(P θ ) for θ ∈ Θ ∗ ⊂ Θ.40.2.4 Statistical target parameter/estimand, and thecorresponding statistical estimation problemThis identifiability result defines a statistical target parameter Ψ : M →IR d .Thegoalistoestimateψ 0 =Ψ(P 0 )basedonn i.i.d. observations onO ∼ P 0 ∈M.Theestimandψ 0 can be interpreted as the target quantityΨ F (θ 0 ) if both the non-testable and the statistical model assumptions hold.Nonetheless, due to the statistical model containing the true data distributionP 0 , ψ 0 always has a pure statistical interpretation as the feature Ψ(P 0 ) of thedata distribution P 0 . A related additional goal is to obtain a confidence intervalfor ψ 0 . A sensitivity analysis can be used to provide statistical inference for theunderlying target quantity ψ0F under a variety of violations of the assumptionsthat were needed to state that ψ0F = ψ 0 , as we will discuss shortly below.40.3 The curse of dimensionality for the MLE40.3.1 Asymptotically linear estimators and influence curvesAn estimator is a Euclidean valued mapping ˆΨ on a statistical model thatcontains all empirical probability distributions. Therefore, one might representan estimator as a mapping ˆΨ :M NP → IR d from the nonparametricstatistical model M NP into the parameter space. In order to allow for statisticalinference, one is particularly interested in estimators that behave in firstorder as an empirical mean of i.i.d. random variables so that it is asymptoticallyNormally distributed. An estimator ˆΨ is asymptotically linear at datadistribution P 0 with influence curve D 0 ifˆΨ(P n ) − ψ 0 = 1 nn∑D 0 (O i )+o P (1/ √ n).i=1


470 Targeted learningSuch an estimator is asymptotically Normally distributed in the sense that√ n (ψn − ψ 0 )convergesindistributiontoN (0,σ 2 ), where σ 2 is the varianceof D 0 (O).40.3.2 Efficient influence curveEfficiency theory teaches us that an estimator is efficient if and only if it isasymptotically linear with influence curve D ∗ 0 equal to the canonical gradientof the pathwise derivative of Ψ : M→IR d .Thiscanonicalgradientisalsocalled the efficient influence curve. Due to this fundamental property, theefficient influence curve is one of the most important objects in statistics.There are general representation theorems and corresponding methods forcalculation of efficient influence curves based on Hilbert space projections; seeBickel et al. (1997) and van der Laan and Robins (2003). Indeed, the efficientinfluence curve forms a crucial ingredient in any methodology for constructionof efficient estimators.40.3.3 Substitution estimatorsEfficient estimators fully utilize the local constraints at P 0 in the model, butthey provide no guarantee of full utilization of the fact that P 0 ∈Mand thatψ 0 =Ψ(P 0 ) for some P 0 ∈M. The latter global information of the estimationproblem is captured by making sure that the estimator is a substitutionestimator that can be represented as ˆΨ(P n ) = Ψ(P ∗ n) for some P ∗ n ∈ M.Alternatively, if Ψ(P ) only depends on P through an (infinite-dimensional)parameter Q(P ), and we denote the target parameter with Ψ(Q) again,thena substitution estimator can be represented as Ψ(Q n ) for a Q n in the parameterspace {Q(P ):P ∈M}.Efficientestimatorsbasedonestimatingequationmethodology (van der Laan and Robins, 2003) provide no guarantee of obtainingsubstitution estimators, and are thereby not as finite sample robustas efficient substitution estimators we discuss here.40.3.4 MLE and curse of dimensionalityMaximum likelihood estimators are examples of substitution estimators. Unfortunately,due to the curse of dimensionality of infinite-dimensional statisticalmodels, the MLE is often ill defined. For example, consider the MLEfor the bivariate failure time distribution based on bivariate right-censoredfailure time data. In this case, an MLE can be implicitly defined (throughthe so called self-consistency equation) as an estimator that assigns mass 1/nto each observation and redistributes this mass over the coarsening for thebivariate failure time implied by this observation according to the MLE itself.In this case, the coarsenings are singletons for the doubly uncensoredobservations (i.e., both failure times are observed), half-lines for the singlycensoredobservations (i.e., one failure time is censored), and quadrants for


M. van der Laan 471the doubly censored observations. For continuously distributed failure times,all the half-lines contain zero uncensored bivariate failure times, and as aconsequence the likelihood does essentially provide no information about howthese masses 1/n should be distributed over the half-lines. That is the MLEis highly non-unique, and thereby also inconsistent.Similarly, consider the estimation of Ψ(P 0 )=E 0 E 0 (Y |A =1,W)basedonn i.i.d. observations on (W, A, Y ) ∼ P 0 , and suppose the statistical model isthe nonparametric model. The MLE of E 0 (Y |A =1,W = w) istheempiricalmean of the outcome among the observations with A i =1,W i = w, whilethe MLE of the distribution of W is the empirical distribution of W 1 ,...,W n .This estimator is ill-defined since most-strata will have no observations.40.3.5 Regularizing MLE through smoothingIn order to salvage the MLE the literature suggests to regularize the MLEin some manner. This often involves either smoothing or sieve-based MLEwhere the fine-tuning parameters need to be selected based on some empiricalcriterion. For example, in our bivariate survival function example, we couldput strips around the half-lines of the single censored observations, and computethe MLE as if the half-lines implied by the single censored observationsare now these strips. Under this additional level of coarsening, the MLE isnow uniquely defined as long as the strips contain at least one uncensoredobservation. In addition, if one makes sure that the number of observations inthe strips converge to infinity as sample size increases, and the width of thestrips converges to zero, then the MLE will also be consistent. Unfortunately,there is still a bias/variance trade-off that needs to be resolved in order to arrangethat the MLE of the bivariate survival function is asymptotically linear.Specifically, we need to make sure that the width of the strips converges fastenough to zero so that the bias of the MLE with respect to the conditionaldensities over the half-lines is o(1/ √ n). This would mean that the width ofthe strips is o(1/ √ n). For an extensive discussion of this estimation problem,and alternative smoothing approach to repair the NPMLE we refer to van derLaan (1996).Similarly, we could estimate the regression function E 0 (Y |A =1,W)witha histogram regression method. If the dimension of W is k-dimensional, thenfor the sake of arranging that the bin contains at least one observation, oneneeds to select a very large width so that the k-dimensional cube with widthh contains one observation with high probability. That is, we will need toselect h so that nh k →∞.ThisbinningcausesbiasO(h) fortheMLEofEE(Y |A =1,W). As a consequence, we will need that n −1/k converges tozero faster than n −1/2 ,whichonlyholdswhenk = 1. In other words, there isno value of the smoothing parameter that results in a regularized MLE thatis asymptotically linear.Even though there is no histogram-regularization possible, there might existother ways of regularizing the MLE. The statistics and machine learning


472 Targeted learningliterature provides many possible approaches to construct estimators of therequired objects Q 0 , and thereby ψ 0 =Ψ(Q 0 ). One strategy would be to definea large class of submodels that contains a sequence of submodels thatapproximates the complete statistical model (a so called sieve), and constructfor each submodel an estimator that achieves the minimax rate under theassumption that Q 0 is an element of this submodel. One can now use a dataadaptive selector to select among all these submodel-specific candidate estimators.This general strategy, which is often referred to as sieve-based MLE,results in a minimax adaptive estimator of Q 0 ,i.e.,theestimatorconvergesat the minimax rate of the smallest submodel (measured by entropy) thatstill contains the true Q 0 . Such an estimator is called minimax adaptive. Werefer to van der Laan and Dudoit (2003) and van der Laan et al. (2006) forsuch general minimum loss-based estimators relying on cross-validation to selectthe subspace. This same strategy can be employed with kernel regressionestimators that are indexed by the degree of orthogonality of the kernel anda bandwidth, and one can use a data-adaptive selector to select this kerneland bandwidth. In this manner the resulting data adaptive kernel regressionestimator will achieve the minimax rate of convergence corresponding withthe unknown underlying smoothness of the true regression function.40.3.6 Cross-validationCross-validation is a particularly powerful tool to select among candidate estimators.In this case, one defines a criterion that measures the performanceof a given fit of Q 0 on a particular sub-sample: typically, this is defined as anempirical mean of a loss function that maps the fit Q and observation O i intoa real number and is such that the minimizer of the expectation of the lossover all Q equals the desired true Q 0 . For each candidate estimator, one trainsthe estimator on a training sample and one evaluates the resulting fit on thecomplement of the training sample, which is called the validation sample. Thisis carried out for a number of sample splits in training and validation samples,and one selects the estimator that has the best average performance across thesample splits. Statistical theory teaches us that this procedure is asymptoticallyoptimal in great generality in the sense that it performs asymptoticallyas well as an oracle selector that selects the estimator based on the criterionapplied to an infinite validation sample; see, e.g., Györfi et al. (2002), van derLaan and Dudoit (2003), van der Laan et al. (2006), and van der Vaart et al.(2006). The key conditions are that the loss-function needs to be uniformlybounded, and the size of the validation sample needs to converge to infinity.


M. van der Laan 47340.4 Super learningThese oracle results for the cross-validation selector teach us that it is possibleto construct an ensemble estimator that asymptotically outperforms any usersupplied library of candidate estimators. We called this estimator the superlearnerdue to its theoretical properties: it is defined as a combination of allthe candidate estimators where the weights defining the combination (e.g.,convex combination) are selected based on cross-validation; see, e.g., van derLaan et al. (2007) and Polley et al. (2012). By using the super-learner as a wayto regularize the MLE, we obtain an estimator with a potentially much betterrate of convergence to the true Q 0 than a simple regularization procedure suchas the one based on binning discussed above. The bias of this super-learnerwill converge to zero at the same rate as the rate of convergence of the superlearner.The bias of the plug-in estimator of ψ 0 based on this super-learner willalso converge at this rate. Unfortunately, if none of our candidate estimatorsin the library achieve the rate 1/ √ n (e.g., a MLE according to a correctlyspecified parametric model), then this bias will be larger than 1/ √ n, so thatthis plug-in estimator will not converge at the desired √ n rate. To conclude,although super-learner has superior performance in estimation of Q 0 ,itstillresults in an overly biased estimator of the target Ψ(Q 0 ).40.4.1 Under-smoothing fails as general methodOur binning discussion above argues that for typical definitions of adaptiveestimators indexed by a fine-tuning parameter (e.g., bandwidth, number ofbasis functions), there is no value of the fine-tuning parameter that wouldresult in a bias for ψ 0 of the order 1/ √ n.Thisisduetothefactthatthefine-tuning parameter needs to exceed a certain value in order to define anestimator in the parameter space of Q 0 .Soevenwhenwewouldhaveselectedthe estimator in our library of candidate estimators that minimizes the MSEwith respect to ψ 0 (instead of the one minimizing the cross-validated risk),then we would still have selected an estimator that is overly biased for ψ 0 .The problem is that our candidate estimators rely on a fine tuning parameterthat controls overall bias of the estimator. Instead we need candidateestimators that have an excellent overall fit of Q 0 but also rely on a tuningparameter that only controls the bias of the resulting plug-in estimator forψ 0 , and we need a way to fit this tuning parameter. For that purpose, weneed to determine a submodel of fluctuations {Q n (ɛ) :ɛ} through a candidateestimator Q n at ɛ = 0, indexed by an amount of fluctuation ɛ, wherefitting ɛ is locally equivalent with fitting ψ 0 in the actual semi-parametricstatistical model M. It appears that the least-favorable submodel from efficiencytheory can be utilized for this purpose, while ɛ can be fitted with theparametric MLE. This insight resulted in so called targeted maximum like-


474 Targeted learninglihood estimators (van der Laan and Rubin, 2006). In this manner, we canmap any candidate estimator into a targeted estimator and we can use thesuper-learner based on the library of candidate targeted estimators. Alternatively,one computes this targeted fit of a super-learner based on a library ofnon-targeted candidate estimators.40.5 Targeted learningThe real message is that one needs to make the learning process targetedtowards its goal. The goal is to construct a good estimator of Ψ(Q 0 ), andthat is not the same goal as constructing a good estimator of the much moreambitious infinite-dimensional object Q 0 .Forexample,estimatorsofΨ(Q 0 )will have a variance that behaves as 1/n, whileaconsistentestimatorofQ 0at a point will generally only use local data so that its variance converges at asignificant slower rate than 1/n. ThebiasofanestimatorofQ 0 is a function,while the bias of an estimator of ψ 0 is just a finite dimensional vector of realnumbers. For parametric maximum likelihood estimators one fits the unknownparameters by solving the score equations. An MLE in a semi-parametricmodel would aim to solve all (infinite) score equations, but due to the curse ofdimensionality such an MLE simply does not exist for finite samples. However,if we know what score equation really matters for fitting ψ 0 , then we can makesure that our estimator will solve that ψ 0 -specific score equation. The efficientinfluence curve of the target parameter mapping Ψ : M→IR d represents thisscore.40.5.1 Targeted minimum loss based estimation (TMLE)The above mentioned insights evolved into the following explicit procedurecalled Targeted Minimum Loss Based Estimation (TMLE); see, e.g., van derLaan and Rubin (2006), van der Laan (2008) and van der Laan and Rose(2012). Firstly, one constructs an initial estimator of Q 0 such as a loss-basedsuper-learner based on a library of candidate estimators of Q 0 .Onenowdefinesa loss function L(Q) so that Q 0 = arg min Q P 0 L(Q), and a least-favorablesubmodel {Q(ɛ) :ɛ} ⊂Mso that the generalized score d dɛ L{Q(ɛ)}∣ ∣ɛ=0equalsor spans the efficient influence curve D ∗ (Q, g). Here we used the notationP 0 f = ∫ f(o)dP 0 (o). This least-favorable submodel might depend on an unknownnuisance parameter g = g(P ). One is now ready to target the fit Q nin such a way that its targeted version solves the efficient score equationP n D ∗ (Q ∗ n,g 0 ) = 0. That is, one defines ɛ n = arg min ɛ P n L{Q n (ɛ)}, and theresulting update Q 1 n = Q n (ɛ n ). This updating process can be iterated till convergenceat which point ɛ n =0sothatthefinalupdateQ ∗ n solves the scoreequation at ɛ n = 0, and thus P n D ∗ (Q ∗ n,g 0 )=0.Theefficientinfluencecurve


M. van der Laan 475has the general property thatP 0 D ∗ (Q, g) =Ψ(Q 0 ) − Ψ(Q)+R(Q, Q 0 ,g,g 0 )for a second order term R(Q, Q 0 ,g,g 0 ) . In fact, in many applications, we havethat R(Q, Q 0 ,g,g 0 ) equals an integral of (Q − Q 0 )(g − g 0 ) so that it equalszero if either Q = Q 0 or g = g 0 , which is often referred to as double robustnessof the efficient influence curve. In that case, P n D ∗ (Q ∗ n,g 0 )=0impliesthatΨ(Q ∗ n)isaconsistentestimatorofψ 0 . In essence, the norm of P n D ∗ (Q ∗ n,g 0 )represents a criterion measuring a distance between Ψ(Q ∗ n) and ψ 0 , so thatminimizing the Euclidean norm of P n D ∗ (Q ∗ n,g n )correspondswithfittingψ 0 .Since in many applications, the nuisance parameter g 0 is unknown, one willhave to replace g 0 in the updating procedure by an estimator g n .Inthatcase,we haveP 0 D ∗ (Q ∗ n,g n )=ψ 0 − Ψ(Q ∗ n)+R(Q ∗ n,Q 0 ,g n ,g 0 ),where the remainder is still a second order term but now also involving crosstermdifferences (Q ∗ n − Q 0 )(g n − g 0 ).40.5.2 Asymptotic linearity of TMLEIf this second order remainder term R(Q ∗ n,Q 0 ,g n ,g 0 )convergestozeroinprobability at a rate faster than 1/ √ n,thenitfollowsthatψ ∗ n − ψ 0 =(P n − P 0 )D ∗ (Q ∗ n,g n )+o P (1/ √ n),so that, if P 0 {D ∗ (Q ∗ n,g n ) − D ∗ (Q 0 ,g 0 )} 2 → 0inprobability,andtherandomfunction D ∗ (Q ∗ n,g n ) of O falls in a P 0 -Donsker class, it follows thatψ ∗ n − ψ 0 =(P n − P 0 )D ∗ (Q 0 ,g 0 )+o P (1/ √ n).That is, √ n (ψ ∗ n − ψ 0 ) is asymptotically Normally distributed with mean zeroand variance equal to the variance of the efficient influence curve. Thus, ifQ ∗ n,g n are consistent at fast enough rates, then ψ ∗ n is asymptotically efficient.Statistical inference can now be based on the Normal limit distribution andan estimator of its asymptotic variance, such as σ 2 n = P n D ∗ (Q ∗ n,g n ) 2 .Thisdemonstrates that the utilization of the state of the art in adaptive estimationwas not a hurdle for statistical inference, but, on the contrary, it is required toestablish the desired asymptotic Normality of the TMLE. Establishing asymptoticlinearity of TMLE under misspecification of Q 0 (in the context that theefficient influence curve is double robust), while still allowing the utilization ofvery adaptive estimators of g 0 ,hastodealwithadditionalchallengesresolvedby also targeting the fit of g; see,e.g.,vanderLaan(2012)andvanderLaanand Rose (2012).


476 Targeted learning40.6 Some special topics40.6.1 Sensitivity analysisThe TMLE methodology provides us with statistical inference for the estimandψ 0 . One typically wants to report findings about the actual target quantityof interest ψ0 F ,butitmightnotbereasonabletoassumethatψ 0 = ψ0 F .Onesimple way forward we recently proposed is to define the bias ψ 0 − ψ0 F , andfor each assumed value δ, one can now estimate ψ0F with ψ n − δ and reporta corresponding confidence interval or p-value for the test that H 0 : ψ0 F = 0.Subject matter knowledge combined with data analysis and or simulationsmight now provide a reasonable upper bound for δ and one can then determineif such an upper bound would still provide significant results for the targetquantity of interest. This sensitivity analysis can be made more conservativein exchange for an enhanced interpretation of the sensitivity parameter δ bydefining δ as a particular upper bound of the causal bias ψ 0 − ψ0 F . Such anupper bound might be easier to interpret and thereby improve the sensitivityanalysis. We refer to Diaz and van der Laan (2012) for an introduction ofthis type of sensitivity analysis, a practical demonstration with a few dataexamples, and a preceding literature using alternative approaches; see, e.g.,Rotnitzky et al. (2001), Robins et al. (1999), and Scharfstein et al. (1999).40.6.2 Sample size 1 problemsAbove we demonstrated that the statistical inference relies on establishingasymptotic linearity and thereby asymptotic Normality of the standardizedestimator of ψ 0 . The asymptotic linearity was heavily relying on the centrallimit theorem and uniform probability bounds for sums of independent variables(e.g., Donsker classes). In many applications, the experiment resultingin the observed data cannot be viewed as a series of independent experiments.For example, observing a community of individuals over time might truly be asingle experiment since the individuals might be causally connected through anetwork. In this case, the sample size is one. Nonetheless, one might know foreach individual what other individuals it depends on, or one might know thatthe data at time t only depends on the past through the data collected over thelast x months. Such assumptions imply conditional independence restrictionson the likelihood of the data. As another example, in a group sequential clinicaltrial one might make the randomization probabilities for the next groupof subjects a function of the observed data on all the previously recruitedindividuals. The general field of adaptive designs concerns the constructionof a single experiment that involves data adaptive changes in the design inresponse to previously observed data, and the key challenge of such designs isdevelop methods that provide honest statistical inference; see, e.g., Rosenblumand van der Laan (2011). These examples demonstrate that targeted learning


M. van der Laan 477should also be concerned with data generated by single experiments that havealotofstructure.ThisrequiresthedevelopmentofTMLEtosuchstatisticalmodels, integration of the state of the art in weak convergence theory for dependentdata, and advances in computing due to the additional complexityof estimators and statistical inference in such data generating experiments.We refer to a few recent examples of targeted learning in adaptive designs,and to estimate effects of interventions on a single network of individuals; seevan der Laan (2008), Chambaz and van der Laan (2010, 2011a,b), van derLaan (2012), and van der Laan et al. (2012).40.6.3 Big DataTargeted learning involves super-learning, complex targeted update steps,evaluation of an often complex estimand, and estimation of the asymptoticvariance of the estimator. In addition, since the estimation is tailored to eachquestion separately, for example, the assessment of the effect of a variable(such as the effect of a DNA-mutation on a phenotype) across a large collectionof variables requires many times repeating these computer intensiveestimation procedures. Even for normal size data sets, such data analyses canalready be computationally very challenging.However, nowadays, many applications contain gigantic data sets. For example,one might collect complete genomic profiles on each individual, sothat one collects hundreds of thousands or even millions of measurements onone individual, possibly at various time points. In addition, there are variousinitiatives in building large comprehensive data bases, such as the sentinelproject which builds a data base for all American citizens which is used toevaluate safety issues for drugs. Such data sets cover hundreds of millions ofindividuals. Many companies are involved in analyzing data on the internet,which can result in data sets with billions of records.40.7 Concluding remarksThe biggest mistake we can make is to give up on sound statistics, and besatisfied with the application of algorithms that can handle these data sets inone way or another, without addressing a well defined statistical estimationproblem. As we have seen, the genomic era has resulted in an erosion of soundstatistics, and as a counterforce many advocate to only apply very simplestatistics such as sample means, and univariate regressions. Neither approachis satisfying, and fortunately, it is not needed to give up on sound and complexstatistical estimation procedures targeting interesting questions of interest.Instead, we need to more fully integrate with the computer science, trainour students in software that can handle these immense computational and


478 Targeted learningmemory challenges, so that our methods can be implemented and made accessibleto the actual users, but simultaneously we need to stick to our identityas statisticians as part of collaborative highly interdisciplinary teams, pushingforward the development of optimal statistical procedures and correspondingstatistical inference to answer the questions of interest. The statistician is fulfillingan absolute crucial role in the design of the experiments, the statisticaland causal formulation of the question of interest, the estimation procedureand thereby the design of the software, the development of valid statisticaltools for statistical inference, and benchmarking these statistical methodswith respect to statistical performance (van der Laan and Rose, 2010).ReferencesAndersen, P., Borgan, O., Gill, R.D., and Keiding, N. (1993). StatisticalModels Based on Counting Processes. Springer,NewYork.Bickel, P.J., Klaassen, C.A.J., Ritov, Y., and Wellner, J.A. (1997). Efficientand Adaptive Estimation for Semiparametric Models.Springer,NewYork.Chambaz, A. and van der Laan, M. (2010). Targeting the Optimal Designin Randomized Clinical Trials with Binary Outcomes and no Covariate.Technical Report 258, Division of Biostatistics, University of California,Berkeley, CA.Chambaz, A. and van der Laan, M. (2011a). Targeting the optimal design inrandomized clinical trials with binary outcomes and no covariate, simulationstudy. International Journal of Biostatistics, 7:1–30.Chambaz, A. and van der Laan, M. (2011b). Targeting the optimal designin randomized clinical trials with binary outcomes and no covariate, theoreticalstudy. International Journal of Biostatistics, 7:1–32.Diaz, I. and van der Laan, M. (2012). Sensitivity Analysis for Causal InferenceUnder Unmeasured Confounding and Measurement Error Problem.Technical Report 303, Division of Biostatistics, University of California,Berkeley, CA.Gill, R.D., van der Laan, M., and Wellner, J.A. (1995). Inefficient estimatorsof the bivariate survival function for three models. Annales de l’InstitutHenri Poincaré: Probabilités et Statistiques, 31:545–597.Györfi, L., Kohler, M., Krzyżak, A., and Walk, H. (2002). A Distribution-freeTheory of Nonparametric Regression. Springer,NewYork.Pearl, J. (2009). Causality: Models, Reasoning, and Inference, 2nd edition.Cambridge University Press, New York.


M. van der Laan 479Petersen, M. and van der Laan, M. (2012). A General Roadmap for theEstimation of Causal Effects. Division of Biostatistics, University of California,Berkeley, CA.Polley, E., Rose, S., and van der Laan, M. (2012). Super learning. In TargetedLearning: Causal Inference for Observational and Experimental Data (M.van der Laan and S. Rose, Eds.). Springer, New York.Robins, J. and Rotnitzky, A. (1992). Recovery of information and adjustmentfor dependent censoring using surrogate markers. In AIDS Epidemiology.Birkhäuser, Basel.Robins, J.M., Rotnitzky, A., and Scharfstein, D.O. (1999). Sensitivity analysisfor selection bias and unmeasured confounding in missing data andcausal inference models. In Statistical Models in Epidemiology, the Environmentand Clinical Trials. Springer,NewYork.Rosenblum, M. and van der Laan, M. (2011). Optimizing randomized trialdesigns to distinguish which subpopulations benefit from treatment.Biometrika, 98:845–860.Rotnitzky, A., Scharfstein, D.O., Su, T.-L., and Robins, J.M. (2001). Methodsfor conducting sensitivity analysis of trials with potentially nonignorablecompeting causes of censoring. Biometrics, 57:103–113.Scharfstein, D.O., Rotnitzky, A., and Robins, J.M. (1999). Adjusting for nonignorabledrop-out using semiparametric non-response models (with discussion).Journal of the American Statistical Association, 94:1096–1146.Starmans, R. (2012). Model, inference, and truth. In Targeted Learning:Causal Inference for Observational and Experimental Data (M. van derLaan and S. Rose, Eds.). Springer, New York.van der Laan, M. (1996). Efficient and Inefficient Estimation in SemiparametricModels. CentrumvoorWiskundeenInformaticaTract114,Amsterdam,The Netherlands.van der Laan, M. (2008). The Construction and Analysis of Adaptive GroupSequential Designs. Technical Report 232, Division of Biostatistics, Universityof California, Berkeley, CA.van der Laan, M. (2012). Statistical Inference when using Data AdaptiveEstimators of Nuisance Parameters. Technical Report 302, Division ofBiostatistics, University of California, Berkeley, CA.van der Laan, M., Balzer, L., and Petersen, M. (2012). Adaptive matchingin randomized trials and observational studies. Journal of StatisticalResearch, 46:113–156.


480 Targeted learningvan der Laan, M. and Dudoit, S. (2003). Unified Cross-validation Methodologyfor Selection Among Estimators and a General Cross-validated AdaptiveEpsilon-net Estimator: Finite Sample Oracle Inequalities and Examples.Technical Report 130, Division of Biostatistics, University of California,Berkeley, CA.van der Laan, M., Dudoit, S., and van der Vaart, A. (2006). The crossvalidatedadaptive epsilon-net estimator. Statistics & Decisions, 24:373–395.van der Laan, M., Polley, E., and Hubbard, A.E. (2007). Super learner. StatisticalApplications in Genetics and Molecular Biology, 6:Article 25.van der Laan, M. and Robins, J.M. (2003). Unified Methods for CensoredLongitudinal Data and Causality. Springer,NewYork.van der Laan, M. and Rose, S. (2010). Statistics ready for a revolution: Nextgeneration of statisticians must build tools for massive data sets. AmstatNews, 399:38–39.van der Laan, M. and Rose, S. (2012). Targeted Learning: Causal Inferencefor Observational and Experimental Data. Springer,NewYork.van der Laan, M. and Rubin, D.B. (2006). Targeted maximum likelihoodlearning. International Journal of Biostatistics, 2:Article 11.van der Vaart, A., Dudoit, S., and van der Laan, M. (2006). Oracle inequalitiesfor multi-fold cross-validation. Statistics & Decisions, 24:351–371.van der Vaart, A. and Wellner, J.A. (1996). Weak Convergence and EmpiricalProcesses. Springer,NewYork.


41Statistical model building, machinelearning, and the ah-ha momentGrace WahbaStatistics, Biostatistics and Medical Informatics, and Computer SciencesUniversity of Wisconsin, Madison, WIHighly selected “ah-ha” moments from the beginning to the present of myresearch career are recalled — these are moments when the main idea justpopped up instantaneously, sparking sequences of future research activity —almost all of these moments crucially involved discussions/interactions withothers. Along with a description of these moments we give unsought advice toyoung statisticians. We conclude with remarks on issues relating to statisticalmodel building/machine learning in the context of human subjects data.41.1 Introduction: Manny Parzen and RKHSMany of the “ah-ha” moments below involve Reproducing Kernel HilbertSpaces (RKHS) so we begin there. My introduction to RKHS came whileattending a class given by Manny Parzen on the lawn in front of the oldSequoia Hall at Stanford around 1963; see Parzen (1962).For many years RKHS (Aronszajn, 1950; Wahba, 1990) were a little nichecorner of research which suddenly became popular when their relation to SupportVector Machines (SVMs) became clear — more on that later. To understandmost of the ah-ha moments it may help to know a few facts about RKHSwhich we now give.An RKHS is a Hilbert space H where all of the evaluation functionals arebounded linear functionals. What this means is the following: let the domainof H be T , and the inner product < ·, · >. Then,foreacht ∈T there existsan element, call it K t in H, withthepropertyf(t) = for all f in H.K t is known as the representer of evaluation at t. Let K(s, t) =;this is clearly a positive definite function on T⊗T.BytheMoore–Aronszajntheorem, every RKHS is associated with a unique positive definite function, as481


482 Ah-ha momentwe have just seen. Conversely, given a positive definite function, there existsa unique RKHS (which can be constructed from linear combinations of theK t ,t ∈T and their limits). Given K(s, t) wedenotetheassociatedRKHSas H K .ObservethatnothinghasbeenassumedconcerningthedomainT .Asecond role of positive definite functions is as the covariance of a zero meanGaussian stochastic process on T .Inathirdrolethatwewillcomeacrosslater—letO 1 ,...,O n be a set of n abstract objects. An n × n positive definitematrix can be used to assign pairwise squared Euclidean distances d ij betweenO i and O j by d ij = K(i, i)+K(j, j)−2K(i, j). In Sections 41.1.1–41.1.9 we gothrough some ah-ha moments involving RKHS, positive definite functions andpairwise distances/dissimilarities. Section 41.2 discusses sparse models and thelasso. Section 41.3 has some remarks involving complex interacting attributes,the “Nature-Nurture” debate, Personalized Medicine, Human subjects privacyand scientific literacy, and we end with conclusions in Section 41.4.I end this section by noting that Manny Parzen was my thesis advisor, andIngram Olkin was on my committee. My main advice to young statisticiansis: choose your advisor and committee carefully, and be as lucky as I was.41.1.1 George Kimeldorf and the representer theoremBack around 1970 George Kimeldorf and I both got to spend a lot of timeat the Math Research Center at the University of Wisconsin Madison (theone that later got blown up as part of the anti-Vietnam-war movement). Atthat time it was a hothouse of spline work, headed by Iso Schoenberg, CarldeBoor, Larry Schumaker and others, and we thought that smoothing splineswould be of interest to statisticians. The smoothing spline of order m was thesolution to: find f in the space of functions with square integral mth derivativeto minimizen∑∫ 1{y i − f(t i )} 2 + λ {f (m) (t)} 2 dt, (41.1)i=1where t 1 ,...,t n ∈ [0, 1]. Professor Schoenberg many years ago had characterizedthe solution to this problem as a piecewise polynomial of degree 2m − 1satisfying some boundary and continuity conditions.Our ah-ha moment came when we observed that the space of functionswith square integrable mth derivative on [0, 1] was an RKHS with seminorm‖Pf‖ defined by‖Pf‖ 2 =∫ 100{f (m) (t)} 2 dtand with an associated K(s, t) that we could figure out. (A seminorm is exactlylike a norm except that it has a non-trivial null space, here the null space ofthis seminorm is the span of the polynomials of degree m−1 or less.) Then byreplacing f(t)byit was not hard to show by a very simple geometricargument that the minimizer of (41.1) was in the span of the K t1 ,...,K tn anda basis for the null space of the seminorm. But furthermore, the very same


G. Wahba 483geometrical argument could be used to solve the more general problem: findf ∈H K , an RKHS, to minimizen∑C(y i ,L i f)+λ‖Pf‖ 2 K, (41.2)i=1where C(y i ,L i f) is convex in L i f,withL i aboundedlinearfunctionalinH K and ‖Pf‖ 2 K aseminorminH K. A bounded linear functional is a linearfunctional with a representer in H K ,i.e.,thereexistsη i ∈ H K such thatL i f =< η i ,f > for all f ∈H K . The minimizer of (41.2) is in the span ofthe representers η i and a basis for the null space of the seminorm. That isknown as the representer theorem, which turned out to be a key to fitting(mostly continuous) functions in an infinite-dimensional space, given a finitenumber of pieces of information. There were two things I remember about ourexcitement over the result: one of us, I’m pretty sure it was George, thoughtthe result was too trivial and not worthwhile to submit, but submit it we didand it was accepted (Kimeldorf and Wahba, 1971) without a single complaint,within three weeks. I have never since then had another paper accepted by arefereed journal within three weeks and without a single complaint. Advice:if you think it is worthwhile, submit it.41.1.2 Svante Wold and leaving-out-oneFollowing Kimeldorf and Wahba, it was clear that for practical use, a methodwas needed to choose the smoothing or tuning parameter λ in (41.1). Thenatural goal was to minimize the mean square error over the function f, forwhich its values at the data points would be the proxy. In 1974 Svante Woldvisited Madison, and we got to mulling over how to choose λ. Itsohappenedthat Mervyn Stone gave a colloquium talk in Madison, and Svante and I weresitting next to each other as Mervyn described using leaving-out-one to decideon the degree of a polynomial to be used in least squares regression.We looked at each other at that very minute and simultaneously said something,I think it was “ah-ha,” but possibly “Eureka.” In those days computertime was $600/hour and Svante wrote a computer program to demonstratethat leaving-out-one did a good job. It took the entire Statistics department’scomputer money for an entire month to get the results in Wahba and Wold(1975). Advice: go to the colloquia, sit next to your research pals.41.1.3 Peter Craven, Gene Golub and Michael Heath andGCVAfter much struggle to prove some optimality properties of leaving-out-one,it became clear that it couldn’t be done in general. Considering the datamodel y = f + ɛ, wherey =(y 1 ,...,y n ) ⊤ ,f =(f(t 1 ),...,f(t n )) ⊤ and ɛ =(ɛ 1 ,...,ɛ n ) ⊤ is a zero mean i.i.d. Gaussian random vector, the information in


484 Ah-ha momentthe data is unchanged by multiplying left and right hand side by an orthogonalmatrix, since Γɛ with Γ orthogonal is still white Gaussian noise. But leavingout-onecan give you a different answer. To explain, we define the influencematrix: let f λ be the minimizer of (41.1) when C is sum of squares. Theinfluence matrix relates the data to the prediction of the data, f λ = A(λ)y,where f λ =(f λ (t 1 ),...,f λ (t n )). A heuristic argument fell out of the blue,probably in an attempt to explain some things to students, that rotating thedata so that the influence matrix was constant down the diagonal, was thetrick. The result was that instead of leaving-out-one, one should minimize theGCV function∑ ni=1V (λ) ={y i − f(t i )} 2[trace{I − A(λ)}] 2(Craven and Wahba, 1979; Golub et al., 1979). I was on sabbatical at Oxfordin 1975 and Gene was at ETH visiting Peter Huber, who had a beautifulhouse in Klosters, the fabled ski resort. Peter invited Gene and me up for theweekend, and Gene just wrote out the algorithm in Golub et al. (1979) on thetrain from Zürich to Klosters while I snuck glances at the spectacular scenery.Gene was a much loved mentor to lots of people. He was born on February29, 1932 and died on November 16, 2007. On February 29 and March 1, 2008his many friends held memorial birthday services at Stanford and 30 otherlocations around the world. Ker-Chau Li (Li, 1985, 1986, 1987b) and otherslater proved optimality properties of the GCV and popular codes in R willcompute splines and other fits using GCV to estimate λ and other importanttuning parameters. Advice: pay attention to important tuning parameterssince the results can be very sensitive to them. Advice: appreciate mentorslike Gene if you are lucky enough to have such great mentors.41.1.4 Didier Girard, Mike Hutchinson, randomized traceand the degrees of freedom for signalBrute force calculation of the trace of the influence matrix A(λ) canbedauntingto compute directly for large n. Let f y λbe the minimizer of (41.1) withthe data vector y and let f y+δλbe the minimizer of (41.1) given the perturbeddata y + δ. Notethatδ ⊤ (f y λ − f y+δλ)=δ ⊤ A(λ)(y + δ) − A(λ)(y) =n∑δ i δ j a ij ,i,j=1where δ i and a ij are the components of δ and A(λ) respectively.Iftheperturbationsare i.i.d. with variance 1, then this sum is an estimate of trace A(λ).This simple idea was proposed in Girard (1989) and Hutchinson (1989), withfurther theory in Girard (1991). It was a big ah-ha when I saw these papersbecause further applications were immediate. In Wahba (1983), p. 139, I definedthe trace of A(λ) as the “equivalent degrees of freedom for signal,” byanalogy with linear least squares regression with p


G. Wahba 485matrix is a rank p projection operator. The degrees of freedom for signal is animportant concept in linear and nonlinear nonparametric regression, and itwas a mistake to hide it inconspicuously in Wahba (1983). Later Brad Efron(2004) gave an alternative definition of degrees of freedom for signal. The definitionin Wahba (1983) depends only on the data; Efron’s is essentially anexpected value. Note that in (41.1),trace{A(λ)} =n∑i=1∂ŷ i∂y i,where ŷ i is the predicted value of y i . This definition can reasonably be appliedto a problem with a nonlinear forward operator (that is, that maps dataonto the predicted data) when the derivatives exist, and the randomized tracemethod is reasonable for estimating the degrees of freedom for signal, althoughcare should be taken concerning the size of δ. Even when the derivatives don’texist the randomized trace can be a reasonable way of getting at the degreesof freedom for signal; see, e.g., Wahba et al. (1995).41.1.5 Yuedong Wang, Chong Gu and smoothing splineANOVASometime in the late 80s or early 90s I heard Graham Wilkinson expoundon ANOVA (Analysis of Variance), where data was given on a regular d-dimensional grid, viz.y ijk , t ijk , i =1,...,I, j =1,...,J, k =1,...,K,for d = 3 and so forth. That is, the domain is the Cartesian product ofseveral one-dimensional grids. Graham was expounding on how fitting a modelfrom observations on such a domain could be described as set of orthogonalprojections based on averaging operators, resulting in main effects, two factorinteractions, etc. “Ah-ha” I thought, we should be able to do exactly samething and more where the domain is the Cartesian product T = T 1 ⊗···⊗T dof d arbitrary domains. We want to fit functions on T , with main effects(functions of one variable), two factor interactions (functions of two variables),and possibly more terms given scattered observations, and we just need todefine averaging operators for each T α .Brainstorming with Yuedong Wang and Chong Gu fleshed out the results.Let H α ,α =1,...,d be d RKHSs with domains T α , each H α containing theconstant functions. H = H 1 ⊗···⊗H d is an RKHS with domain T . For eachα =1,...,d,constructaprobabilitymeasuredµ α on T α ,withthepropertythat the symbol (E α f)(t), the averaging operator, defined by∫(E α f)(t) = f(t 1 ,...,t d )dµ α (t α ),T ( α)


486 Ah-ha momentis well defined and finite for every f ∈Hand t ∈T.Considerthedecompositionof the identity operator:I = ∏ α(E α +(I −E α )) = ∏ αE α + ∑ α(I −E α ) ∏ β≠αE β+ ∑ (I −E α )(I −E β ) ∏E γ + ···+ ∏ (I −E α ). (41.3)α


G. Wahba 487as proposed by Vapnik and coworkers (Vapnik, 1995) was derived from anargument nothing like what I am about to give. Somewhere during Vladimir’stalk, an unknown voice towards the back of the audience called out “Thatlooks like Grace Wahba’s stuff.” It looked obvious that the SVM as proposedby Vapnik with the “kernel trick,” could be obtained as the the solution tothe optimization problem of (41.2) with C(y i ,L i f)replacedbythesocalledhinge function, (1 − y i f(t i )) + ,where(τ) + = τ if τ>0 and 0 otherwise. Eachdata point is coded as ±1 according as it came from the “plus” class or the“minus” class. For technical reasons the null space of the penalty functionconsists at most of the constant functions. Thus it follows that the solutionis in the span of the representers K ti from the chosen RKHS plus possibly aconstant function. Yi Lin and coworkers (Lin et al., 2002a,b) showed that theSVM was estimating the sign of the log odds ratio, just what is needed for twoclass classification. The SVM may be compared to the case where one desiresto estimate the probability that an object is in the plus class. If one beginswith the penalized log likelihood of the Bernoulli distribution and codes thedata as ±1 instead of the usual coding as 0 or 1, then we have the same optimizationproblem with C(y i ,f(t i )) = ln{1+e −yif(ti) } instead of (1−y i f(t i )) +with solution in the same finite dimensional space, but it is estimating the logodds-ratio, as opposed to the sign of the log odds ratio. It was actually a bigdeal that the SVM could be directly compared with penalized likelihood withBernoulli data, and it provided a pathway for statisticians and computer scientiststo breach a major divide between them on the subject of classification,and to understand each others’ work.For many years before the Hadley meeting, Olvi Mangasarian and I wouldtalk about what we were doing in classification, neither of us having any understandingof what the other was doing. Olvi complained that the statisticiansdismissed his work, but it turned out that what he was doing was related tothe SVM and hence perfectly legitimate not to mention interesting, from aclassical statistical point of view. Statisticians and computer scientists havebeen on the same page on classification ever since.It is curious to note that several patents have been awarded for the SVM.One of the early ones, issued on July 15, 1997 is “5649068 Pattern recognitionsystem using support vectors.” I’m guessing that the unknown volunteer wasDavid Donoho. Advice: keep your eyes open to synergies between apparentlydisparate fields.41.1.7 Yoonkyung Lee, Yi Lin and the multi-category SVMFor classification, when one has k>2 classes it is always possible to applyan SVM to compare membership in one class versus the rest of the k classes,running through the algorithm k times. In the early 2000s there were manypapers on one-vs-rest, and designs for subsets vs. other subsets, but it is possibleto generate examples where essentially no observations will be identifiedas being in certain classes. Since one-vs-rest could fail in certain circumstances


488 Ah-ha momentit was something of an open question how to do multi-category SVMs in oneoptimization problem that did not have this problem. Yi Lin, Yoonkyung Leeand I were sitting around shooting the breeze and one of us said “how about asum-to-zero constraint?” and the other two said “ah-ha,” or at least that’s theway I remember it. The idea is to code the labels as k-vectors, with a 1 in therth position and −1/(k − 1) in the k − 1 other positions for a training samplein class r. Thus, each observation vector satisfies the sum-to-zero constraint.The idea was to fit a vector of functions satisfying the same sum-to-zero constraint.The multi-category SVM fit estimates f(t) =(f 1 (t),...,f k (t)), t ∈Tsubject to the sum-to-zero constraint everywhere and the classification for asubject with attribute vector t is just the index of the largest component ofthe estimate of f(t). See Lee and Lee (2003) and Lee et al. (2004a,b). Advice:shooting the breeze is good.41.1.8 Fan Lu, Steve Wright, Sunduz Keles, Hector CorradaBravo, and dissimilarity informationWe return to the alternative role of positive definite functions as a way toencode pairwise distance observations. Suppose we are examining n objectsO 1 ,...,O n and are given some noisy or crude observations on their pairwisedistances/dissimilarities, which may not satisfy the triangle inequality. Thegoal is to embed these objects in a Euclidean space in such a way as to respectthe pairwise dissimilarities as much as possible. Positive definite matricesencode pairwise squared distances d ij between O i and O j asd ij (K) =K(i, i)+K(j, j) − 2K(i, j), (41.5)and, given a non-negative definite matrix of rank d ≤ n, canbeusedtoembedthe n objects in a Euclidean space of dimension d, centered at 0 and uniqueup to rotations. We seek a K which respects the dissimilarity information d obsijwhile constraining the complexity of K by∑min |dobsij − d ij (K)| + λ trace(K), (41.6)K∈S nwhere S n is the convex cone of symmetric positive definite matrices. I lookedat this problem for an inordinate amount of time seeking an analytic solutionbut after a conversation with Vishy (S.V.N. Vishwanathan) at a meeting inRotterdam in August of 2003 I realized it wasn’t going to happen. The ahhamoment came about when I showed the problem to Steve Wright, whoright off said it could be solved numerically using recently developed convexcone software. The result so far is Corrada Bravo et al. (2009) and Lu et al.(2005). In Lu et al. (2005) the objects are protein sequences and the pairwisedistances are BLAST scores. The fitted kernel K had three eigenvalues thatcontained about 95% of the trace, so we reduced K to a rank 3 matrix bytruncating the smaller eigenvalues. Clusters of four different kinds of proteinswere readily separated visually in three-d plots; see Lu et al. (2005) for the


G. Wahba 489details. In Corrada Bravo et al. (2009) the objects are persons in pedigreesin a demographic study and the distances are based on Malecot’s kinshipcoefficient, which defines a pedigree dissimilarity measure. The resulting kernelbecame part of an SS ANOVA model with other attributes of persons, andthe model estimates a risk related to an eye disease. Advice: find computerscientist friends.41.1.9 Gábor Székely, Maria Rizzo, Jing Kong and distancecorrelationThe last ah-ha experience that we report is similar to that involving the randomizedtrace estimate of Section 41.1.4, i.e., the ah-ha moment came aboutupon realizing that a particular recent result was very relevant to what wewere doing. In this case Jing Kong brought to my attention the important paperof Gábor Székely and Maria Rizzo (Székely and Rizzo, 2009). Briefly, thispaper considers the joint distribution of two random vectors, X and Y ,say,and provides a test, called distance correlation that it factors so that the tworandom vectors are independent. Starting with n observations from the jointdistribution, let {A ij } be the collection of double-centered pairwise distancesamong the ( n2)X components, and similarly for {Bij }.Thestatistic,calleddistance correlation, is the analogue of the usual sample correlation betweenthe A’s and B’s. The special property of the test is that it is justified for Xand Y in Euclidean p and q space for arbitrary p and q with no further distributionalassumptions. In a demographic study involving pedigrees (Konget al., 2012), we observed that pairwise distance in death age between closerelatives was less than that of unrelated age cohorts. A mortality risk scorefor four lifestyle factors and another score for a group of diseases was developedvia SS ANOVA modeling, and significant distance correlation was foundbetween death ages, lifestyle factors and family relationships, raising morequestions than it answers regarding the “Nature-Nurture” debate (relativerole of genetics and other attributes).We take this opportunity to make a few important remarks about pairwisedistances/dissimilarities, primarily how one measures them can be important,and getting the “right” dissimilarity can be 90% of the problem. We remarkthat family relationships in Kong et al. (2012) were based on a monotonefunction of Malecot’s kinship coefficient that was different from the monotonefunction in Corrada Bravo et al. (2009). Here it was chosen to fit in with thedifferent way the distances were used. In (41.6), the pairwise dissimilaritiescan be noisy, scattered, incomplete and could include subjective distances like“very close, close.. ” etc. not even satisfying the triangle inequality. So there issubstantial flexibility in choosing the dissimilarity measure with respect to theparticular scientific context of the problem. In Kong et al. (2012) the pairwisedistances need to be a complete set, and be Euclidean (with some specificmetric exceptions). There is still substantial choice in choosing the definitionof distance, since any linear transformation of a Euclidean coordinate system


490 Ah-ha momentdefines a Euclidean distance measure. Advice: think about how you measuredistance or dissimilarity in any problem involving pairwise relationships, itcan be important.41.2 Regularization methods, RKHS and sparse modelsThe optimization problems in RKHS are a rich subclass of what can be calledregularization methods, which solve an optimization problem which tradesfit to the data versus complexity or constraints on the solution. My first encounterwith the term “regularization” was Tikhonov (1963) in the contextof finding numerical solutions to integral equations. There the L i of (41.2)were noisy integrals of an unknown function one wishes to reconstruct, butthe observations only contained a limited amount of information regardingthe unknown function. The basic and possibly revolutionary idea at the timewas to find a solution which involves fit to the data while constraining thesolution by what amounted to an RKHS seminorm, ( ∫ {f ′′ (t)} 2 dt) standinginfor the missing information by an assumption that the solution was “smooth”(O’Sullivan, 1986; Wahba, 1977). Where once RKHS were a niche subject,they are now a major component of the statistical model building/machinelearning literature.However, RKHS do not generally provide sparse models, that is, modelswhere a large number of coefficients are being estimated but only a small butunknown number are believed to be non-zero. Many problems in the “BigData” paradigm are believed to have, or want to have sparse solutions, forexample, genetic data vectors that may have many thousands of componentsand a modest number of subjects, as in a case-control study. The most popularmethod for ensuring sparsity is probably the lasso (Chen et al., 1998; Tibshirani,1996). Here a very large dictionary of basis functions (B j (t), j =1, 2,...)is given and the unknown function is estimated as f(t) = ∑ j β jB j (t) withthe penalty functional λ ∑ j |β j| replacing an RKHS square norm. This willinduce many zeroes in the β j , depending, among other things on the size ofλ. Since then, researchers have commented that there is a “zoo” of proposedvariants of sparsity-inducing penalties, many involving assumptions on structuresin the data; one popular example is Yuan and Lin (2006). Other recentmodels involve mixtures of RKHS and sparsity-inducing penalty functionals.One of our contributions to this “zoo” deals with the situation where the datavectors amount to very large “bar codes,” and it is desired to find patternsin the bar codes relevant to some outcome. An innovative algorithm whichdeals with a humongous number of interacting patterns assuming that onlya small number of coefficients are non-zero is given in Shi et al. (2012), Shiet al. (2008) and Wright (2012).


G. Wahba 491As is easy to see here and in the statistical literature, the statistical modelerhas overwhelming choices in modeling tools, with many public codes availablein the software repository R and elsewhere. In practice these choices must bemade with a serious understanding of the science and the issues motivatingthe data collection. Good collaborations with subject matter researchers canlead to the opportunity to participate in real contributions to the science.Advice: learn absolutely as much as you can about the subject matter of thedata that you contemplate analyzing. When you use “black boxes” be sureyou know what is inside them.41.3 Remarks on the nature-nurture debate,personalized medicine and scientific literacyWe and many other researchers have been developing methods for combiningscattered, noisy, incomplete, highly heterogenous information from multiplesources with interacting variables to predict, classify, and determine patternsof attributes relevant to a response, or more generally multiple correlatedresponses.Demographic studies, clinical trials, and ad hoc observational studies basedon electronic medical records, which have familial (Corrada Bravo et al., 2009;Kong et al., 2012), clinical, genetic, lifestyle, treatment and other attributescan be a rich source of information regarding the Nature-Nurture debate, aswell informing Personalized Medicine, two popular areas reflecting much activity.As large medical systems put their records in electronic form interestingproblems arise as to how to deal with such unstructured data, to relate subjectattributes to outcomes of interest. No doubt a gold mine of informationis there, particularly with respect to how the various attributes interact. Thestatistical modeling/machine learning community continues to create and improvetools to deal with this data flood, eager to develop better and moreefficient modeling methods, and regularization and dissimilarity methods willno doubt continue to play an important role in numerous areas of scientific endeavor.With regard to human subjects studies, a limitation is the problem ofpatient confidentiality — the more attribute information available to explorefor its relevance, the trickier the privacy issues, to the extent that de-identifieddata can actually be identified. It is important, however, that statisticians beinvolved from the very start in the design of human subjects studies.With health related research, the US citizenry has some appreciation ofscientific results that can lead to better health outcomes. On the other handany scientist who reads the newspapers or follows present day US politics ispainfully aware that a non-trivial portion of voters and the officials they electhave little or no understanding of the scientific method. Statisticians need to


492 Ah-ha momentparticipate in the promotion of increased scientific literacy in our educationalestablishment at all levels.41.4 ConclusionIn response to the invitation from COPSS to contribute to their 50th AnniversaryCelebration, I have taken a tour of some exciting moments in my career,involving RKHS and regularization methods, pairwise dissimilarities and distances,and lasso models, dispensing un-asked for advice to new researchersalong the way. I have made a few remarks concerning the richness of modelsbased on RKHS, as well as models involving sparsity-inducing penaltieswith some remarks involving the Nature-Nurture Debate and PersonalizedMedicine. I end this contribution with thanks to my many coauthors — identifiedhere or not — and to my terrific present and former students. Advice:Treasure your collaborators! Have great students!AcknowledgementsThis work was supported in part by the National Science Foundation GrantDMS 0906818 and the National Institutes of Health Grant R01 EY09946. Theauthor thanks Didier Girard for pointing out a mistake and also noting thathis 1987 report entitled “Un algorithme simple et rapide pour la validationcroisée généralisée sur des problèmes de grande taille” (Rapport RR669–M,Informatique et mathématiques appliquées de Grenoble, France) predates thereference Girard (1989) given here.ReferencesAronszajn, N. (1950). Theory of reproducing kernels. Transactions of theAmerican Mathematical Society, 68:337–404.Chen, S., Donoho, D., and Saunders, M. (1998). Atomic decomposition bybasis pursuit. SIAM Journal of Scientific Computing, 20:33–61.Corrado Bravo, H., Leeb, K.E., Klein, B.E., Klein, R., Iyengar, S.K., andWahba, G. (2009). Examining the relative influence of familial, geneticand environmental covariate information in flexible risk models. Pro-


G. Wahba 493ceedings of the National Academy of Sciences, 106:8128–8133. OpenSource at www.pnas.org/content/106/20/8128.full.pdf+html, PM-CID: PMC 2677979.Craven, P. and Wahba, G. (1979). Smoothing noisy data with spline functions:Estimating the correct degree of smoothing by the method of generalizedcross-validation. Numerische Mathematik, 31:377–403.Efron, B. (2004). The estimation of prediction error: Covariance penalties andcross-validation. Journal of the American Statistical Association, 99:619–632.Girard, D. (1989). A fast “Monte Carlo cross-validation” procedure for largeleast squares problems with noisy data. Numerische Mathematik,56:1–23.Girard, D. (1991). Asymptotic optimality of the fast randomized versionsof GCV and C L in ridge regression and regularization. The Annals ofStatistics, 19:1950–1963.Golub, G., Heath, M., and Wahba, G. (1979). Generalized cross validationas a method for choosing a good ridge parameter. Technometrics,21:215–224.Gu, C. (2002). Smoothing Spline ANOVA Models. Springer,NewYork.Gu, C. and Wahba, G. (1993). Smoothing spline ANOVA with componentwiseBayesian “confidence intervals.” Journal of Computational andGraphical Statistics, 2:97–117.Hutchinson, M. (1989). A stochastic estimator for the trace of the influencematrix for Laplacian smoothing splines. Communications in Statistics —Simulations, 18:1059–1076.Kimeldorf, G. and Wahba, G. (1971). Some results on Tchebycheffian splinefunctions. Journal of Mathematical Analysis and its Applications, 33:82–95.Kong, J., Klein, B.E., Klein, R., Lee, K.E., and Wahba, G. (2012). Usingdistance correlation and smoothing spline ANOVA to assess associationsof familial relationships, lifestyle factors, diseases and mortality. PNAS,pp. 20353–20357.Lee, Y. and Lee, C.-K. (2003). Classification of multiple cancer typesby multi-category support vector machines using gene expression data.Bioinformatics, 19:1132–1139.Lee, Y., Lin, Y., and Wahba, G. (2004). Multicategory support vector machines,theory, and application to the classification of microarray data andsatellite radiance data. Journal of the American Statistical Association,99:67–81.


494 Ah-ha momentLee, Y., Wahba, G., and Ackerman, S. (2004). Classification of satellite radiancedata by multicategory support vector machines. Journal of Atmosphericand Oceanic Technology, 21:159–169.Li, K.C. (1985). From Stein’s unbiased risk estimates to the method of generalizedcross-validation. The Annals of Statistics, 13:1352–1377.Li, K.C. (1986). Asymptotic optimality of C L and generalized cross validationin ridge regression with application to spline smoothing. The Annals ofStatistics, 14:1101–1112.Li, K.C. (1987). Asymptotic optimality for C p , C L , cross-validation and generalizedcross validation: Discrete index set. The Annals of Statistics,15:958–975.Lin, Y., Lee, Y., and Wahba, G. (2002). Support vector machines for classificationin nonstandard situations. Machine Learning, 46:191–202.Lin, Y., Wahba, G., Zhang, H., and Lee, Y. (2002). Statistical properties andadaptive tuning of support vector machines. Machine Learning, 48:115–136.Lu, F., Keleş, S., Wright, S.J., and Wahba, G. (2005). A framework forkernel regularization with application to protein clustering. Proceedingsof the National Academy of Sciences, 102:12332–12337.OpenSourceatwww.pnas.org/content/102/35/12332, PMCID:PMC118947.O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse problems.Statistical Science, 1:502–527.Parzen, E. (1962). An approach to time series analysis. The Annals of MathematicalStatistics, 32:951–989.Shi, W., Wahba, G., Irizarry, R.A., Corrada Bravo, H., and Wright, S.J.(2012). The partitioned LASSO-patternsearch algorithm with applicationto gene expression data. BMC Bioinformatics, 13-98.Shi, W., Wahba, G., Wright, S.J., Lee, K., Klein, R., and Klein, B. (2008).LASSO-patternsearch algorithm with application to ophthalmology andgenomic data. Statistics and its Interface, 1:137–153.Székely, G. and Rizzo, M. (2009). Brownian distance covariance. The Annalsof Applied Statistics, 3:1236–1265.Tibshirani, R.J. (1996). Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288.Tikhonov, A. (1963). Solution of incorrectly formulated problems and theregularization method. Soviet Mathematics Doklady, 4:1035–1038.


G. Wahba 495Vapnik, V. (1995). The Nature of Statistical Learning Theory. Springer,Berlin.Wahba, G. (1977). Practical approximate solutions to linear operator equationswhen the data are noisy. SIAM Journal of Numerical Analysis,14:651–667.Wahba, G. (1983). Bayesian “confidence intervals” for the cross-validatedsmoothing spline. Journal of the Royal Statistical Society, Series B,45:133–150.Wahba, G. (1990). Spline Models for Observational Data.SIAM.CBMS–NSFRegional Conference Series in Applied Mathematics, vol. 59.Wahba, G., Johnson, D.R., Gao, F., and Gong, J. (1995). Adaptive tuning ofnumerical weather prediction models: Randomized GCV in three and fourdimensional data assimilation. Monthly Weather Review, 123:3358–3369.Wahba, G. and Wold, S. (1975). A completely automatic French curve. Communicationsin Statistics, 4:1–17.Wang, Y. (2011). Smoothing Splines: Methods and Applications. Chapman&Hall,London.Wright, S.J. (2012). Accelerated block-coordinate relaxation for regularizedoptimization. SIAM Journal of Optimization, 22:159–186.Preprintandsoftware available at http://pages.cs.wisc.edu/~swright/LPS/.Yuan, M. and Lin, Y. (2006). Model selection and estimation in regressionwith grouped variables. Journal of the Royal Statistical Society, Series B,68:49–67.


42In praise of sparsity and convexityRobert J. TibshiraniDepartment of StatisticsStanford University, Stanford, CATo celebrate the 50th anniversary of the COPSS, I present examples of excitingdevelopments of sparsity and convexity, in statistical research and practice.42.1 IntroductionWhen asked to reflect on an anniversary of their field, scientists in most fieldswould sing the praises of their subject. As a statistician, I will do the same.However, here the praise is justified! Statistics is a thriving discipline, moreand more an essential part of science, business and societal activities. Classenrollments are up — it seems that everyone wants to be a statistician —and there are jobs everywhere. The field of machine learning, discussed in thisvolume by my friend Larry Wasserman, has exploded and brought along withit the computational side of statistical research. Hal Varian, Chief Economistat Google, said “I keep saying that the sexy job in the next 10 years willbe statisticians. And I’m not kidding.” Nate Silver, creator of the New YorkTimes political forecasting blog “538” was constantly in the news and ontalk shows in the runup to the 2012 US election. Using careful statisticalmodelling, he forecasted the election with near 100% accuracy (in contrastto many others). Although his training is in economics, he (proudly?) callshimself a statistician. When meeting people at a party, the label “Statistician”used to kill one’s chances of making a new friend. But no longer!In the midst of all this excitement about the growing importance of statistics,there are fascinating developments within the field itself. Here I willdiscuss one that has been the focus my research and that of many otherstatisticians.497


498 Sparsity and convexity .^ .^ FIGURE 42.1Estimation picture for the lasso (left) and ridge regression (right). Shownare contours of the error and constraint functions. The solid areas are theconstraint regions |β 1 | + |β 2 |≤t and β1 2 + β22 ≤ t 2 ,respectively,whiletheellipses are the contours of the least squares error function. The sharp cornersof the constraint region for the lasso yield sparse solutions. In high dimensions,sparsity arises from corners and edges of the constraint region.42.2 Sparsity, convexity and l 1 penaltiesOne of the earliest proposals for using l 1 or absolute-value penalties, was thelasso method for penalized regression. Given a linear regression with predictorsx ij and response values y i for i ∈{1,...,N} and j ∈{1,...,p}, thelassosolves the l 1 -penalized regressionminimize β⎧⎪⎨⎪ ⎩12⎛N∑⎝y i −i=1p∑j=1⎞2x ij β j⎠ + λ⎫p∑⎪⎬|β j |⎪⎭ .This is equivalent to minimizing the sum of squares with constraint |β 1 |+···+|β p |≤s.Itissimilartoridgeregression,whichhasconstraintβ 2 1 +···+β 2 p ≤ s.Because of the form of the l 1 penalty, the lasso does variable selection andshrinkage; while ridge regression, in contrast, only shrinks. If we consider amore general penalty of the form (β q 1 + ···+ βq p) 1/q ,thenthelassousesq =1and ridge regression has q = 2. Subset selection emerges as q → 0, and thelasso corresponds to the smallest value of q (i.e., closest to subset selection)that yields a convex problem. Figure 42.1 gives a geometric view of the lassoand ridge regression.The lasso and l 1 penalization have been the focus of a great deal of workrecently. Table 42.1, adapted from Tibshirani (2011), gives a sample of thiswork.j=1


R.J. Tibshirani 499TABLE 42.1Asamplingofgeneralizationsofthelasso.MethodAuthorsAdaptive lasso Zou (2006)Compressive sensing Donoho (2004), Candès (2006)Dantzig selector Candès and Tao (2007)Elastic net Zou and Hastie (2005)Fused lasso Tibshirani et al. (2005)Generalized lasso Tibshirani and Taylor (2011)Graphical lasso Yuan and Lin (2007b), Friedman et al. (2010)Grouped lassoYuan and Lin (2007a)Hierarchical interaction models Bien et al. (2013)Matrix completion Candès and Tao (2009), Mazumder et al. (2010)Multivariate methods Joliffe et al. (2003), Witten et al. (2009)Near-isotonic regression Tibshirani et al. (2011)The original motivation for the lasso was interpretability: it is an alternativeto subset regression for obtaining a sparse model. Since that time, twounforeseen advantages of convex l 1 -penalized approaches have emerged: Computationaland statistical efficiency. On the computational side, convexity ofthe problem and sparsity of the final solution can be used to great advantage.When most parameter estimates are zero in the solution, those parameterscan be handled with minimal cost in the search for the solution. Powerful andscalable techniques for convex optimization can be unleashed on the problem,allowing the solution of very large problems. One particularly promising approachis coordinate descent (Fu, 1998; Friedman et al., 2007, 2010), a simpleone-at-a-time method that is well-suited to the separable lasso penalty. Thismethod is simple and flexible, and can also be applied to a wide variety of otherl 1 -penalized generalized linear models, including Cox’s proportional hazardsmodel for survival data. Coordinate descent is implemented in the popularglmnet package in the R statistical language, written by Jerome Friedman,Trevor Hastie, and myself, with help in the Cox feature from Noah Simon.On the statistical side, there has also been a great deal of deep and interestingwork on the mathematical aspects of the lasso, examining its ability toproduce a model with minimal prediction error, and also to recover the trueunderlying (sparse) model. Important contributors here include Bühlmann,Candès, Donoho, Greenshtein, Johnstone, Meinshausen, Ritov, Wainwright,Yu, and many others. In describing some of this work, Hastie et al. (2001)coined the informal “Bet on Sparsity” principle. The l 1 methods assume thatthe truth is sparse, in some basis. If the assumption holds true, then the parameterscan be efficiently estimated using l 1 penalties. If the assumptiondoes not hold — so that the truth is dense — then no method will be able to


500 Sparsity and convexityrecover the underlying model without a large amount of data per parameter.This is typically not the case when p ≫ N, acommonlyoccurringscenario.42.3 An exampleI am currently involved in a cancer diagnosis project with researchers at StanfordUniversity. They have collected samples of tissue from 10 patients undergoingsurgery for stomach cancer. The aim is to build a classifier that candistinguish three kinds of tissue: normal epithelial, stromal and cancer. Such aclassifier could be used to assist surgeons in determining, in real time, whetherthey had successfully removed all of the tumor. It could also yield insights intothe cancer process itself. The data are in the form of images, as sketched inFigure 42.2. A pathologist has labelled each region (and hence the pixels insidea region) as epithelial, stromal or cancer. At each pixel in the image, theintensity of metabolites is measured by a kind of mass spectrometry, with thepeaks in the spectrum representing different metabolites. The spectrum hasbeen finely sampled at about 11,000 sites. Thus the task is to build a classifierto classify each pixel into one of the three classes, based on the 11,000 features.There are about 8000 pixels in all.For this problem, I have applied an l 1 -regularized multinomial model. Foreach class k ∈{1, 2, 3}, themodelhasavector(β 1k ,...,β pk ) of parametersrepresenting the weight given to each feature in that class. I used the glmnetpackage for fitting the model: it computes the entire path of solutions for allvalues of the regularization parameter λ, using cross-validation to estimatethe best value of λ (I left one patient out at a time). The entire computationrequired just a few minutes on a standard Linux server.The results so far are encouraging. The classifier shows 93–97% accuracyin the three classes, using only around 100 features. These features could yieldinsights about the metabolites that are important in stomach cancer. There ismuch more work to be done — collecting more data, and refining and testingthe model. But this shows the potential of l 1 -penalized models in an importantand challenging scientific problem.42.4 The covariance testSo far, most applications of the lasso and l 1 penalties seem to focus on largeproblems, where traditional methods like all-subsets-regression can’t deal withthe problem computationally. In this last section, I want to report on some


R.J. Tibshirani 501CancerEpithelialStromalSpectrum for each pixelSpectrum sampled at 11,000 m/z valuesFIGURE 42.2Schematic of the cancer diagnosis problem. Each pixel in each of the threeregions labelled by the pathologist is analyzed by mass spectrometry. Thisgives a feature vector of 11,000 intensities (bottom panel), from which we tryto predict the class of that pixel.very recent work that suggest that l 1 penalties may have a more fundamentalrole in classical mainstream statistical inference.To begin, consider standard forward stepwise regression. This procedureenters predictors one a time, choosing the predictor that most decreases theresidual sum of squares at each stage. Defining RSS to be the residual sumof squares for the model containing j predictors and denoting by RSS null theresidual sum of squares for the model omitting the predictor k(j), we can formthe usual statisticR j =(RSS null − RSS)/σ 2(with σ assumed known for now), and compare it to a χ 2 (1) distribution.Although this test is commonly used, we all know that it is wrong. Figure42.3 shows an example. There are 100 observations and 10 predictors ina standard Gaussian linear model, in which all coefficients are actually zero.The left panel shows a quantile-quantile plot of 500 realizations of the statisticR 1 versus the quantiles of the χ 2 (1)distribution. The test is far too liberaland the reason is clear: the χ 2 (1)distribution is valid for comparing two fixednested linear models. But here we are adaptively choosing the best predictor,and comparing its model fit to the null model.In fact it is difficult to correct the chi-squared test to account for adaptiveselection: half-sample splitting methods can be used (Meinshausen et al., 2009;


502 Sparsity and convexity Test statistic0 2 4 6 8 10 12 0 2 4 6 8 10 12Exp(1)(a) Forward stepwise(b) LassoFIGURE 42.3Asimpleexamplewithn = 100 observations and p = 10 orthogonal predictors.All true regression coefficients are zero, β ∗ = 0. On the left is a quantilequantileplot, constructed over 1000 simulations, of the standard chi-squaredstatistic R 1 ,measuringthedropinresidualsumofsquaresforthefirstpredictorto enter in forward stepwise regression, versus the χ 2 1 distribution. Thedashed vertical line marks the 95% quantile of the χ 2 (1)distribution. The rightpanel shows a quantile-quantile plot of the covariance test statistic T 1 for thefirst predictor to enter in the lasso path, versus its asymptotic distributionE(1). The covariance test explicitly accounts for the adaptive nature of lassomodeling, whereas the usual chi-squared test is not appropriate for adaptivelyselected models, e.g., those produced by forward stepwise regression.Wasserman and Roeder, 2009), but these may suffer from lower power due tothe decrease in sample size.But the lasso can help us! Specifically, we need the LAR (least angle regression)method for constructing the lasso path of solutions, as the regularizationparameter λ is varied. I won’t give the details of this construction here, but wejust need to know that there are a special set of decreasing knots λ 1 > ···>λ kat which the active set of solutions (the non-zero parameter estimates) change.When λ>λ 1 , the solutions are all zero. At the point λ = λ 1 , the variablemost correlated with y enters the model. At each successive value λ j , a variableenters or leaves the model, until we reach λ k where we obtain the fullleast squares solution (or one such solution, if p>N).We consider a test statistic analogous to R j for the lasso. Let y be thevector of outcome values and X be the design matrix. Assume for simplicity


R.J. Tibshirani 503that the error variance σ 2 is known. Suppose that we have run LAR for j − 1steps, yielding the active set of predictors A at λ = λ j . Now we take onemore step, entering a new predictor k(j), and producing estimates ˆβ(λ j ) atλ j+1 .Wewishtotestifthek(j)th component β k(j) is zero. We refit the lasso,keeping λ = λ j+1 but using just the variables in A. Thisyieldsestimatesˆβ A (λ j+1 ). Our proposed covariance test statistic is defined byT j = 1 σ 2 {〈y, X ˆβ(λ j+1 )〉−〈y, X A ˆβA (λ j+1 )〉}. (42.1)Roughly speaking, this statistic measures how much of the covariance betweenthe outcome and the fitted model can be attributed to the k(j)th predictor,which has just entered the model.Now something remarkable happens. Under the null hypothesis that allsignal variables are in the model: as p →∞, T j converges to an exponentialrandom variable with unit mean, E(1). The right panel of Figure 42.3 showsthe same example, using the covariance statistic. This test works for testingthe first variable to enter (as in the example), or for testing noise variablesafter all of the signal variables have entered. And it works under quite generalconditions on the design matrix. This result properly accounts for the adaptiveselection: the shrinkage in the l 1 fitting counteracts the inflation due toselection, in just the right way to make the degrees of freedom (mean) of thenull distribution exactly equal to 1 asymptotically. This idea can be appliedto a wide variety of models, and yields honest p-values that should be usefulto statistical practitioners.In a sense, the covariance test and its exponential distribution generalizethe RSS test and its chi-squared distribution, to the adaptive regressionsetting.This work is very new, and is summarized in Lockhart et al. (2014). Theproofs of the results are difficult, and use extreme-value theory and Gaussianprocesses. They suggest that the LAR knots λ k may be fundamental inunderstanding the effects of adaptivity in regression. On the practical side, regressionsoftware can now output honest p-values as predictors enter a model,that properly account for the adaptive nature of the process. And all of thismay be a result of the convexity of the l 1 -penalized objective.42.5 ConclusionIn this chapter I hope that I have conveyed my excitement for some recentdevelopments in statistics, both in its theory and practice. I predict that convexityand sparsity will play an increasing important role in the developmentof statistical methodology.


504 Sparsity and convexityReferencesBien, J., Taylor, J., and Tibshirani, R.J. (2013). A lasso for hierarchicalinteractions. The Annals of Statistics, 41:1111–1141.Candès, E.J. (2006). Compressive sampling. In Proceedings of the InternationalCongress of Mathematicians, Madrid,Spain.www.acm.caltech.edu/~emmanuel/papers/CompressiveSampling.pdfCandès, E.J. and Tao, T. (2007). The Dantzig selector: Statistical estimationwhen p is much larger than n. The Annals of Statistics, 35: 2313–2351.Candès, E.J. and Tao, T. (2009). The power of convex relaxation: Nearoptimalmatrix completion. http://www.citebase.org/abstract?id=oai:arXiv.org:0903.1476Donoho, D.L. (2004). Compressed Sensing. TechnicalReport,StatisticsDepartment,Stanford University, Stanford, CA. www-stat.stanford.edu/~donoho/Reports/2004/CompressedSensing091604.pdfFriedman, J.H., Hastie, T.J., and Tibshirani, R.J. (2007). Pathwise coordinateoptimization. The Annals of Applied Statistics, 1:302–332.Friedman, J.H., Hastie, T.J., and Tibshirani, R.J. (2010). Regularizationpaths for generalized linear models via coordinate descent. Journal ofStatistical Software, 33:1–22.Fu, W. (1998). Penalized regressions: The bridge versus the lasso. Journal ofComputational and Graphical Statistics, 7:397–416.Hastie, T.J., Tibshirani, R.J., and Friedman, J.H. (2001). The Elements ofStatistical Learning: Data Mining, Inference and Prediction. Springer,New York.Joliffe, I.T., Trendafilov, N.T., and Uddin, M. (2003). A modified principalcomponent technique based on the lasso. Journal of Computational andGraphical Statistics, 12:531–547.Lockhart, R.A., Taylor, J., Tibshirani, R.J., and Tibshirani, R.J. (2014). Asignificance test for the lasso (with discussion). The Annals of Statistics,in press.Mazumder, R., Hastie, T.J., and Tibshirani, R.J. (2010). Spectral regularizationalgorithms for learning large incomplete matrices. Journal of MachineLearning Research, 11:2287–2322.Meinshausen, N., Meier, L., and Bühlmann, P. (2009). P -values for highdimensionalregression. Journal of the American Statistical Association,104:1671–1681.


R.J. Tibshirani 505Tibshirani, R.J. (2011). Regression shrinkage and selection via the lasso: Aretrospective. Journal of the Royal Statistical Society, Series B, 73:273–282.Tibshirani, R.J., Hoefling, H., and Tibshirani, R.J. (2011). Nearly-isotonicregression. Technometrics,53:54–61.Tibshirani, R.J., Saunders, M., Rosset, S., Zhu, J., and Knight, K. (2005).Sparsity and smoothness via the fused lasso. Journal of the Royal StatisticalSociety, Series B, 67:91–108.Tibshirani, R.J. and Taylor, J. (2011). The solution path of the generalizedlasso. The Annals of Statistics, 39:1335–1371.Wasserman, L.A. and Roeder, K. (2009). High-dimensional variable selection.Journal of the American Statistical Association, 37:2178–2201.Witten, D.M., Tibshirani, R.J., and Hastie, T.J. (2009). A penalized matrixdecomposition, with applications to sparse principal components andcanonical correlation analysis. Biometrika, 10:515–534.Yuan, M. and Lin, Y. (2007a). Model selection and estimation in regressionwith grouped variables. Journal of the Royal Statistical Society, Series B,68:49–67.Yuan, M. and Lin, Y. (2007b). Model selection and estimation in the Gaussiangraphical model. Biometrika, 94:19–35.Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of theAmerican Statistical Association, 101:1418–1429.Zou, H. and Hastie, T.J. (2005). Regularization and variable selection via theelastic net. Journal of the Royal Statistical Society, Series B, 67:301–320.


43Features of Big Data and sparsest solutionin high confidence setJianqing FanDepartment of Operations Research and Financial EngineeringPrinceton University, Princeton, NJThis chapter summarizes some of the unique features of Big Data analysis.These features are shared neither by low-dimensional data nor by small samples.Big Data pose new computational challenges and hold great promises forunderstanding population heterogeneity as in personalized medicine or services.High dimensionality introduces spurious correlations, incidental endogeneity,noise accumulation, and measurement error. These unique features arevery distinguished and statistical procedures should be designed with theseissues in mind. To illustrate, a method called a sparsest solution in highconfidenceset is introduced which is generally applicable to high-dimensionalstatistical inference. This method, whose properties are briefly examined, isnatural as the information about parameters contained in the data is summarizedby high-confident sets and the sparsest solution is a way to deal withthe noise accumulation issue.43.1 IntroductionThe first decade of this century has seen the explosion of data collection inthis age of information and technology. The technological revolution has madeinformation acquisition easy and cheap through automated data collectionprocesses. Massive data and high dimensionality characterize many contemporarystatistical problems from biomedical sciences to engineering and socialsciences. For example, in disease classification using microarray or proteomicsdata, tens of thousands of expressions of molecules or proteins are potentialpredictors; in genome-wide association studies, hundreds of thousands of SNPsare potential covariates; in machine learning, tens of thousands of featuresare extracted from documents, images and other objects; in spatial-temporal507


508 Features of Big Dataproblems encountered in economics and earth sciences, time series from hundredsor thousands of regions are collected. When interactions are considered,the dimensionality grows even more quickly. Other examples of massive datainclude high-resolution images, high-frequency financial data, fMRI data, e-commerce data, marketing data, warehouse data, functional and longitudinaldata, among others. For an overview, see Hastie et al. (2009) and Bühlmannand van de Geer (2011).Salient features of Big Data include both large samples and high dimensionality.Furthermore, Big Data are often collected over different platforms orlocations. This generates issues with heterogeneity, measurement errors, andexperimental variations. The impacts of dimensionality include computationalcost, algorithmic stability, spurious correlations, incidental endogeneity, noiseaccumulations, among others. The aim of this chapter is to introduce and explainsome of these concepts and to offer a sparsest solution in high-confidentset as a viable solution to high-dimensional statistical inference.In response to these challenges, many new statistical tools have been developed.These include boosting algorithms (Freund and Schapire, 1997; Bickelet al., 2006), regularization methods (Tibshirani, 1996; Chen et al., 1998; Fanand Li, 2001; Candès and Tao, 2007; Fan and Lv, 2011; Negahban et al., 2012),and screening methods (Fan and Lv, 2008; Hall et al., 2009; Li et al., 2012).According to Bickel (2008), the main goals of high-dimensional inference areto construct as effective a method as possible to predict future observations, togain insight into the relationship between features and response for scientificpurposes, and hopefully, to improve prediction.As we enter into the Big Data era, an additional goal, thanks to largesample size, is to understand heterogeneity. Big Data allow one to apprehendthe statistical properties of small heterogeneous groups, termed “outliers”when sample size is moderate. It also allows us to extract important but weaksignals in the presence of large individual variations.43.2 HeterogeneityBig Data enhance our ability to find commonalities in a population, even in thepresence of large individual variations. An example of this is whether drinkinga cup of wine reduces health risks of certain diseases. Population structurescan be buried in the presence of large statistical noise in the data. Nevertheless,large sample sizes enable statisticians to mine such hidden structures.What also makes Big Data exciting is that it holds great promises for understandingpopulation heterogeneity and making important discoveries, sayabout molecular mechanisms involved in diseases that are rare or affectingsmall populations. An example of this kind is to answer the question why


J. Fan 509chemotherapy is helpful for certain populations, while harmful or ineffectivefor some other populations.Big Data are often aggregated from different sites and different platforms.Experimental variations need to be accounted for before their full analysis. BigData can be thought of as a mixture of data arising from many heterogeneouspopulations. Let k be the number of heterogeneous groups, X be a collectionof high-dimensional covariates, and y be a response. It is reasonable to regardBig Data as random realizations from a mixture of densities, viz.p 1 f 1 (y; θ 1 (x)) + ···+ p k f k (y; θ k (x)),in which f j (y; θ j (x)) is the conditional density of Y given X = x in populationj ∈{1,...,k}, and the function θ j (x) characterizes the dependence of thedistribution on the covariates. Gaussian mixture models are a typical example;see, e.g., Khalili and Chen (2007) or Städler et al. (2010).When the sample size is moderate, data from small groups with small p jrarely occur. Should such data be sampled, they are usually regarded as statisticaloutliers or buried in the larger groups. There are insufficient amountsof data to infer about θ j (x). Thanks to Big Data, when n is so large that np jis also large, there are sufficient amounts of data to infer about commonalityθ j (x) in this rare subpopulation. In this fashion, Big Data enable us to discovermolecular mechanisms or genetic associations in small subpopulations,opening the door to personalized treatments. This holds true also in consumerservices where different subgroups demand different specialized services.The above discussion further suggests that Big Data are paramountly importantin understanding population heterogeneity, a goal that would be illusorywhen the sample size is only moderately large. Big Data provide a wayin which heterogeneous subpopulations can be distinguished and personalizedtreatments can be derived. It is also an important tool for the discovery ofweak population structures hidden in large individual variations.43.3 ComputationLarge-scale computation plays a vital role in the analysis of Big Data. Highdimensionaloptimization is not only expensive but also unstable in computation,in addition to slowness in convergence. Algorithms that involve iterativeinversions of large matrices are infeasible due to instability and computationalcosts. Scalable and stable implementations of high-dimensional statistical proceduresmust be sought. This relies heavily on statistical intuition, large-scalescreening and small-scale optimization. An example is given in Fan et al.(2009).Large numbers of observations, which can be in the order of tens of thousandsor even millions as in genomics, neuro-informatics, marketing, and online


510 Features of Big Datalearning studies, also give rise to intensive computation. When the sample sizeis large, the computation of summary statistics such as correlations among allvariables is expensive. Yet statistical methods often involve repeated evaluationsof such functions. Parallel computing and other updating techniques arerequired. Therefore, scalability of techniques to both dimensionality and thenumber of cases should be borne in mind when developing statistical procedures.43.4 Spurious correlationSpurious correlation is a feature of high dimensionality. It refers to variablesthat are not correlated theoretically but whose sample correlation is high. Toillustrate the concept, consider a random sample of size n = 50 of p independentstandard N (0, 1) random variables. Thus the population correlationbetween any two random variables is zero and their corresponding samplecorrelation should be small. This is indeed the case when the dimension issmall in comparison with the sample size. When p is large, however, spuriouscorrelations start to appear. To illustrate this point, let us computeˆr = maxj≥2 ĉorr(Z 1,Z j )where ĉorr(Z 1 ,Z j ) is the sample correlation between variables Z 1 and Z j .Similarly, we can computeˆR = max|S|=5 ĉorr(Z 1, Z S ), (43.1)which is the maximum multiple correlation between Z 1 and Z S with 1 /∈ S,namely, the correlation between Z 1 and its best linear predictor using Z S .Inthe implementation, we use the forward selection algorithm as an approximationto compute ˆR, whichisnolargerthan ˆR but avoids computing all ( )p5multiple R 2 in (43.1). This experiment is repeated 200 times.The empirical distributions of ˆr and ˆR are shown in Figure 43.1. Thespurious correlation ˆr is centered around .45 for p =1000and.55forp =10,000. The corresponding values are .85 and .91 when the multiple correlationˆR is used. Theoretical results on the order of the spurious correlation ˆr aregiven in Cai and Jiang (2012) and Fan et al. (2012), but the order of ˆR remainsunknown.The impact of spurious correlation includes false scientific discoveries andfalse statistical inferences. In terms of scientific discoveries, Z 1 and ZŜ arepractically indistinguishable when n = 50, given that their correlation isaround .9 for a set Ŝ with |Ŝ| = 5. If Z 1 represents the expression of a genethat is responsible for a disease, we can discover five genes Ŝ that have a similarpredictive power even though they are unrelated to the disease. Similarly,


J. Fan 511Realizations of two independent normal2 1 0 1 22 1 0 1 2Density141210864200.4 0.5 0.6 0.7 0.8Maximum absolute sample correlation35p = 10 3p = 10 330p = 10 4 p = 10 425Density201510500.7 0.8 0.9 1Maximum absolute multiple correlationFIGURE 43.1Illustration of spurious correlation. Left panel: a typical realization of Z 1 withits most spuriously correlated variable (p = 1000); middle and right panels:distributions of ˆr and ˆR for p =1000andp =10,000.Thesamplesizeisn = 50.if the genes in Ŝ are truly responsible for a disease, we may end up wronglypronouncing Z 1 as the gene that is responsible for the disease.We now examine the impact of spurious correlation on statistical inference.Consider a linear modelY = X ⊤ β + ε,σ 2 =var(ε).The residual variance based on a selected set Ŝ of variables isˆσ 2 =1n −|Ŝ| Y⊤ (I n − PŜ)Y, PŜ = XŜ(X ⊤ Ŝ X Ŝ )−1 X ⊤ Ŝ .When the variables are not data selected and the model is unbiased, thedegree of freedom adjustment makes the residual variance unbiased. However,the situation is completely different when the variables are data selected. Forexample, when β = 0, one has Y = ɛ and all selected variables are spurious.If the number of selected variables |Ŝ| is much smaller than n, thenˆσ 2 =1n −|Ŝ| (1 − γ2 n)‖ɛ‖ 2 ≈ (1 − γ 2 n)σ 2 ,where γ 2 n = ɛ ⊤ PŜɛ/‖ɛ‖ 2 .Therefore,σ 2 is underestimated by a factor of γ 2 n.Suppose that we select only one spurious variable. This variable mustthen be mostly correlated with Y or, equivalently, ɛ. Becausethespuriouscorrelation is high, the bias is large. The two left panels of Figure 43.2 depictthe distributions of γ n along with the associated estimates of ˆσ 2 for differentchoices of p. Clearly,thebiasincreaseswiththedimension,p.When multiple spurious variables are selected, the biases of residual varianceestimation become more pronounced, since the spurious correlation getslarger as demonstrated in Figure 43.1. To illustrate this, consider the linear


512 Features of Big Data1210(a)p = 10p = 100p = 1000p = 500032.5(b)p = 10p = 100p = 1000p = 5000454035s = 1s = 2s = 5s = 10(a)15(b)s = 1s = 2s = 5s = 10823010density of γ n6density of σ n21.5density2520density of σ n2411551020.5500 0.2 0.4 0.6 0.800 0.5 1 1.5 200.2 0.4 0.6 0.8 100 0.5 1 1.5FIGURE 43.2Distributions of spurious correlations. Left panel: Distributions of γ n for thenull model when |Ŝ| = 1 and their associated estimates of σ2 =1forvariouschoices of p. Rightpanel:Distributionsofγ n for the model Y =2X 1 +.3X 2 +ε2and their associated estimates of σ =1forvariouschoicesof|Ŝ| but fixedp =1000.Thesamplesizen =50.AdaptedfromFanetal.(2012).model Y =2X 1 + .3X 2 + ɛ and use the stepwise selection method to recruitvariables. Again, the spurious variables are selected mainly due to their spuriouscorrelation with ɛ, theunobservedbutrealizedvectorofrandomnoises.As shown in the two right panels of Figure 43.2, the spurious correlation isvery large and ˆσ 2 gets notably more biased when |Ŝ| gets larger.Underestimation of residual variance leads the statistical inference astray.Variables are declared statistically significant that are not in reality, and thisleads to faulty scientific conclusions.43.5 Incidental endogeneityHigh dimensionality also gives rise to incidental endogeneity. Scientists collectcovariates that are potentially related to the response. As there are manycovariates, some of those variables can be incidentally correlated with theresidual noise. This can cause model selection inconsistency and incorrect


J. Fan 513selection of genes or SNPs for understanding molecular mechanism or geneticassociations.Let us illustrate this problem using the simple linear model. The idealizedmodel for variable selection is that there is a small subset S 0 of variables thatexplains a large portion of the variation in the response Y ,viz.Y = X ⊤ β 0 + ε, E(εX) =0, (43.2)in which the true parameter vector β 0 has support S 0 . The goal of variableselection is to find the set S 0 and estimate the regression coefficients β 0 .To be more concrete, let us assume that the data generating process isY = X 1 + X 2 + ε, so that S 0 = {1, 2}. As we do not know which variables arerelated to Y in the joint model, we collect as many covariates as possible thatwe deem to be potentially related to Y ,inthehopeofincludingallmembersin S 0 . Some of those X j are incidentally correlated with Y − X 1 − X 2 or ε.This makes model (43.2) invalid. The rise of incidental endogeneity is due tohigh dimensionality, making the specifications E(εX) =0invalidforsomecollected covariates, unintentionally. The more covariates are collected, themore unlikely this assumption is.Does incidental endogeneity arise in practice? Can the exogeneity assumptionE(εX) = 0 be validated? After data collection, variable selection techniquessuch as the lasso (Tibshirani, 1996; Chen et al., 1998) and foldedconcave penalized least squares (Fan and Li, 2001; Zou and Li, 2008) arefrequently used before drawing conclusions. The model is rarely validated. Indeed,the residuals were computed based only on a small set of the selectedvariables. Unlike with ordinary least squares, the exogeneity assumption in(43.2) cannot be validated empirically because most variables are not used tocompute the residuals. We now illustrate this fact with an example.Consider the gene expressions of 90 western Europeans from the international“HapMap” project (Thorisson et al., 2005); these data are availableon ftp://ftp.sanger.ac.uk/pub/genevar/. Thenormalizedgeneexpressiondata were generated with an Illumina Sentrix Human-6 Expression BeadChip (Stranger et al., 2007). We took the gene expressions of CHRNA6,cholinergicreceptor, nicotinic, alpha 6, as the response variable, and the remainingexpression profiles of 47,292 as covariates. The left panel of Figure 43.3presents the correlation between the response variable and its associated covariates.Lasso is then employed to find the genes that are associated with theresponse. It selects 23 genes. The residuals ˆε are computed, which are basedon those genes. The right panel of Figure 43.3 displays the distribution of thesample correlations between the covariates and the residuals. Clearly, many ofthem are far from zero, which is an indication that the exogeneity assumptionin (43.2) cannot be validated. That is, incidental endogeneity is likely present.What is the consequence of this endogeneity? Fan and Liao (2014) show thatthis causes model selection inconsistency.


514 Features of Big Data250020001500100050000.5 0 0.5250020001500100050000.5 0 0.5FIGURE 43.3Distributions of sample correlations. Left panel: Distributions of the samplecorrelation ĉorr(X j ,Y)(j =1,..., 47,292). Right panel: Distribution of thesample correlation ĉorr(X j , ˆε), in which ˆε represents the residuals after thelasso fit.How do we deal with endogeneity? Ideally, we hope to be able to selectconsistently S 0 under only the assumption thatY = X ⊤ S 0β S0,0 + ε, E(εX S0 )=0,but this assumption is too weak to recover the set S 0 .AstrongerassumptionisY = X ⊤ S 0β S0,0 + ε, E(ε|X S0 )=0. (43.3)Fan and Liao (2014) use over identification conditions such asE(εX S0 ) = 0 and E(εX 2 S 0)=0 (43.4)to distinguish endogenous and exogenous variables, which are weaker than thecondition in (43.3). They introduce the Focused Generalized Method of Moments(FGMM) which uses the over identification conditions to select consistentlythe set of variables S 0 .Thereaderscanrefertotheirpaperfortechnicaldetails. The left panel of Figure 43.4 shows the distribution of the correlationsbetween the covariates and the residuals after the FGMM fit. Many of thecorrelations are still non-zero, but this is fine, as we assume only (43.4) andmerely need to validate this assumption empirically. For this data set, FGMM


J. Fan 515 65432100.05 0 0.05 0.1 0.15FIGURE 43.4Left panel: Distribution of the sample correlation ĉorr(X j , ˆε), in which ˆε representsthe residuals after the FGMM fit. Right panel: Distribution of thesample correlation ĉorr(X j , ˆε) for only selected 5 genes by FGMM.selects five genes. Therefore, we need only validate 10 empirical correlationsspecified by conditions (43.4). The empirical correlations between the residualsafter the FGMM fit and the five selected covariates are zero, and theircorrelations with squared covariates are small. The results are displayed inthe right panel of Figure 43.4. Therefore, our model assumptions and modeldiagnostics are consistent.43.6 Noise accumulationWhen a method depends on the estimation of many parameters, the estimationerrors can accumulate. For high-dimensional statistics, noise accumulationis more severe and can even dominate the underlying signals. Consider, forexample, a linear classification which assigns the class label 1(x ⊤ β>0) foreach new data point x. Thisrulecanhavehighdiscriminationpowerwhenβ is known. However, when an estimator ˆβ is used instead, the classificationrule can be as bad as a random guess due to the accumulation of errors inestimating the high-dimensional vector ˆβ.


516 Features of Big DataAs an illustration, we simulate n data points respectively from the populationN (µ 0 , I p ) and N (µ 1 , I p ), in which p = 4500, µ 0 = 0, and µ 1 is arealization of a mixture of point mass 0 with probability .98 and the standarddouble exponential distribution with probability .02. Therefore, most componentshave no discriminative power, yet some components are very powerful inclassification. Indeed, among 2% or 90 realizations from the double exponentialdistributions, several components are very large, and many componentsare small.The distance-based classifier, which classifies x to class 1 when‖x − µ 1 ‖ 2 ≤‖x − µ 0 ‖ 2 or β ⊤ (x − µ) ≥ 0,where β = µ 1 − µ 0 and µ =(µ 0 + µ 1 )/2. Letting Φ denote the cumulativedistribution function of a standard Normal random variable, we find that themisclassification rate is Φ(−‖µ 1 − µ 0 ‖/2), which is effectively zero because bythe Law of Large Numbers,‖µ 1 − µ 0 ‖≈ √ 4500 × .02 × 1 ≈ 9.48.However, when β is estimated by the sample mean, the resulting classificationrule behaves like a random guess due to the accumulation of noise.To help the intuition, we drew n = 100 data points from each class andselected the best m features from the p-dimensional space, according to theabsolute values of the components of µ 1 ; this is an infeasible procedure, butcan be well estimated when m is small (Fan and Fan, 2008). We then projectedthe m-dimensional data on their first two principal components. Figure 43.5presents their projections for various values of m. Clearly,whenm = 2, thesetwo projections have high discriminative power. They still do when m = 100,as there are noise accumulations and also signal accumulations too. Thereare about 90 non-vanishing signals, though some are very small; the expectedvalues of those are approximately 9.48 as noted above. When m = 500 or4500, these two projections have no discriminative power at all due to noiseaccumulation. See also Hall et al. (2005) for a geometric representation of highdimension and low sample size data for further intuition.43.7 Sparsest solution in high confidence setTo attenuate the noise accumulation issue, we frequently impose the sparsityon the underlying parameter β 0 . At the same time, the information on β 0contained in the data is through statistical modeling. The latter is summarizedby confidence sets of β 0 in R p .Combiningthesetwopiecesofinformation,ageneral solution to high-dimensional statistics is naturally the sparsest solutionin high-confidence set.


J. Fan 5174(a) m=210(b) m=1002502045 0 555 0 5 1010(c) m=5004(d) m=45005200521010 0 10 2044 2 0 2 4FIGURE 43.5Scatter plot of projections of observed data (n =100fromeachclass)ontothefirst two principal components of the m-dimensional selected feature space.43.7.1 A general setupWe now elaborate the idea. Assume that the Big Data are collected in the form(X 1 ,Y 1 ),...,(X n ,Y n ), which can be regarded as a random sample from thepopulation (X,Y). We wish to find an estimate of the sparse vector β 0 ∈ R psuch that it minimizes L(β) = E{L(X ⊤ β,Y )}, inwhichthelossfunctionis assumed convex in the first argument so that L(β) isconvex.Thesetupencompasses the generalized linear models (McCullagh and Nelder, 1989)with L(θ, y) = b(θ) − θy under the canonical link where b(θ) isamodeldependentconvex function, robust regression with L(θ, y) =|y − θ|, thehingeloss L(θ, y) =(1− θy) + in the support vector machine (Vapnik, 1999) andexponential loss L(θ, y) =exp(−θy) in AdaBoost (Freund and Schapire, 1997;Breiman, 1998) in classification in which y takes values ±1, among others. LetL n (β) = 1 nn∑L(X ⊤ i β,Y i )i=1be the empirical loss and L ′ n(β) beitsgradient.GiventhatL ′ (β 0 ) = 0, anatural confidence set is of formC n = {β ∈ R p : ‖L ′ n(β)‖ ∞ ≤ γ n }


518 Features of Big Datafor some given γ n that is related to the confidence level. Here L ′ n(β) = 0 canbe regarded as the estimation equations. Sometimes, it is handy to constructthe confidence sets directly from the estimation equations.In principle, any norm can be used in constructing confidence set. However,we take the L ∞ -norm as it is the conjugate norm to the L 1 -norm in Hölder’sinequality. It also makes the set C n convex, because |L ′ n(β)| is nondecreasingin each argument. The tuning parameter γ n is chosen so that the set C n hasconfidence level 1 − δ n ,viz.Pr(β 0 ∈C n )=Pr{‖L ′ n(β 0 )‖ ∞ ≤ γ n }≥1 − δ n . (43.5)The confidence region C n is called a high confidence set because δ n → 0 andcan even be zero. Note that the confidence set is the interface between the dataand parameters; it should be applicable to all statistical problems, includingthose with measurement errors.The set C n is the summary of the data information about β 0 . If in additionwe assume that β 0 is sparse, then a natural solution is the intersection of thesetwo pieces of information, namely, finding the sparsest solution in the highconfidenceregion, viz.min ‖β‖ 1 = min ‖β‖ 1 . (43.6)β∈C n ‖L ′ n (β)‖∞≤γnThis is a convex optimization problem. Here, the sparsity is measured by theL 1 -norm, but it can also be measured by other norms such as the weightedL 1 -norm (Zou and Li, 2008). The idea is related to that in Negahban et al.(2012), where a nice framework for analysis of high-dimensional M-estimatorswith decomposable regularizers is established for restricted convex losses.43.7.2 ExamplesThe Danzig selector (Candès and Tao, 2007) is a specific case of problem (43.6)in which the loss is quadratic L(x, y) =(x − y) 2 and δ n = 0. This providesan alternative view to the Danzig selector. If L(x, y) =ρ(|x − y|) for a convexfunction ρ, thentheconfidencesetimpliedbythedataisC n = {β ∈ R p : ‖ρ ′ (|Y − Xβ|)X ⊤ svn(Y − Xβ)‖ ∞ ≤ γ n }and the sparsest solution in the high confidence set is now given bymin ‖β‖ 1 , subject to ‖ρ ′ (|Y − Xβ|)X ⊤ svn(Y − Xβ)‖ ∞ ≤ γ n .In particular, when ρ(θ) =θ and ρ(θ) =θ 2 /2, they correspond to the L 1 -lossand L 2 -loss (the Danzig selector).Similarly, in construction of sparse precision Θ = Σ −1 for the Gaussiangraphic model, if L(Θ, S n )=‖ΘS n − I p ‖ 2 F where S n is the sample covariancematrix and ‖·‖ F is the Frobenius norm, then the high confidence set providedby the data isC n = {Θ : ‖S n · (ΘS n − I p )‖ ∞ ≤ γ n },


J. Fan 519where · denotes the componentwise product (a factor 2 of off-diagonal elementsis ignored). If we construct the high-confidence set based directly onthe estimation equations L ′ n(Θ) =ΘS n − I p ,thenthesparsehigh-confidenceset becomesmin‖ΘS n−I p‖ ∞≤γ n‖vec(Θ)‖ 1 .If the matrix L 1 -norm is used in (43.6) to measure the sparsity, then theresulting estimator is the CLIME estimator of Cai et al. (2011), viz.min ‖Θ‖ 1 .‖ΘS n−I p‖ ∞≤γ nIf we use the Gaussian log-likelihood, viz.L n (Θ) =− ln(|Θ|)+tr(ΘS n ),then L ′ n(Θ) =−Θ −1 + S n and C n = {‖Θ −1 − S n ‖ ∞ ≤ γ n }.Thesparsestsolution is then given bymin ‖Θ‖ 1 .‖Θ −1 −S n‖ ∞≤γ nIf the relative norm ‖A‖ ∞ = ‖Θ 1/2 AΘ 1/2 ‖ ∞ is used, the solution can bemore symmetrically written asmin ‖Θ‖ 1 .‖Θ 1/2 S nΘ 1/2 −I p‖ ∞≤γ nIn the construction of the sparse linear discriminant analysis from twoNormal distributions N (µ 0 , Σ) and N (µ 1 , Σ), the Fisher classifier is linearand of the form 1{β ⊤ (X − µ) > 0}, whereµ =(µ 0 + µ 1 )/2, δ = µ 1 − µ 0 ,and β = Σ −1 δ.Theparametersµ and δ can easily be estimated from thesample. The question is how to estimate β, whichisassumedtobesparse.Onedirect way to construct confidence interval is to base directly the estimationequations L ′ n(β) =S n β − ˆδ, whereS n is the pooled sample covariance and ˆδis the difference of the two sample means. The high-confidence set is thenC n = {β : ‖S n β − ˆδ‖ ∞ ≤ γ n }. (43.7)Again, this is a set implied by data with high confidence. The sparsest solutionis the linear programming discriminant rule by Cai et al. (2011).The above method of constructing confidence is neither unique nor thesmallest. Observe that (through personal communication with Dr Emre Barut)‖S n β − ˆδ‖ ∞ = ‖(S n − Σ)β + δ − ˆδ‖ ∞ ≤‖(S n − Σ)‖ ∞ ‖β‖ 1 + ‖δ − ˆδ‖ ∞ .Therefore, a high confidence set can be taken asC n = {‖S n β − ˆδ‖ ∞ ≤ γ n,1 ‖β‖ 1 + γ n,2 }, (43.8)where γ n,1 and γ n,2 are the high confident upper bound of ‖(S n − Σ)‖ ∞ and‖δ − ˆδ‖ ∞ . The set (43.8) is smaller than the set (43.7), since a further bound‖β‖ 1 in (43.8) by a constant γ n,3 yields (43.7).


520 Features of Big Data43.7.3 PropertiesLet ˆβ be a solution to (43.6) and ˆ∆ = ˆβ − β 0 . As in the Danzig selection, thefeasibility of β 0 implied by (43.5) entails thatLetting S 0 =supp(β 0 ), we have‖β 0 ‖ 1 ≥‖ˆβ‖ 1 = ‖β 0 + ˆ∆‖ 1 . (43.9)‖β 0 + ˆ∆‖ 1 = ‖(β 0 + ˆ∆) S0 ‖ 1 + ‖ ˆ∆ S c0‖ 1 ≥‖β 0 ‖ 1 −‖ˆ∆ S0 ‖ 1 + ‖ ˆ∆ S c0‖ 1 .This together with (43.9) yields‖ ˆ∆ S0 ‖ 1 ≥‖ˆ∆ S c0‖ 1 , (43.10)i.e., ˆ∆ is sparse or “restricted.” In particular, with s = |S0 |,‖ ˆ∆‖ 2 ≥‖ˆ∆ S0 ‖ 2 ≥‖ˆ∆ S0 ‖ 1 / √ s ≥‖ˆ∆‖ 1 /(2 √ s), (43.11)where the last inequality uses (43.10). At the same time, since ˆβ and β 0 arein the feasible set (43.5), we have‖L ′ n( ˆβ) − L ′ n(β 0 )‖ ∞ ≤ 2γ nwith probability at least 1 − δ n . By Hölder’s inequality, we conclude that|[L ′ n( ˆβ) − L ′ n(β 0 )] ⊤ ˆ∆| ≤2γn ‖ ˆ∆‖ 1 ≤ 4 √ sγ n ‖ ˆ∆‖ 2 (43.12)with probability at least 1 − δ n ,wherethelastinequalityutilizes(43.11).Byusing the Taylor’s expansion, we can prove the existence of a point β ∗ onthe line segment between β 0 and ˆβ such that L ′ n( ˆβ) − L ′ n(β 0 )=L ′′ n(β ∗ ) ˆ∆.Therefore,| ˆ∆ ⊤ L ′′ n(β ∗ ) ˆ∆| ≤4 √ sγ n ‖ ˆ∆‖ 2 .Since C n is a convex set, β ∗ ∈C n . If we generalize the restricted eigenvaluecondition to the generalized restricted eigenvalue condition, viz.then we haveinf‖∆ S0 ‖ 1≥‖∆ S c0‖ 1inf |∆ ⊤ L ′′ n(β)∆|/‖∆‖ 2 2 ≥ a, (43.13)β∈C n‖ ˆ∆‖ 2 ≤ 4a −1√ sγ n . (43.14)The inequality (43.14) is a statement on the L 2 -convergence of ˆβ, with probabilityat least 1 − δ n . Note that each component ofL ′ n( ˆβ) − L ′ n(β 0 )=L ′ n(β 0 + ˆ∆) − L ′ n(β 0 )in (43.12) has the same sign as the corresponding component of ˆ∆. Condition(43.13) can also be replaced by the requirementinf |[L ′ n(β 0 +∆)− L ′ n(β 0 )] ⊤ ∆|≥a‖∆‖ 2 .‖∆ S0 ‖ 1≥‖∆ S c ‖ 1 0This facilitates the case where L ′′ n does not exist and is a specific case ofNegahban et al. (2012).


J. Fan 52143.8 ConclusionBig Data arise from many frontiers of scientific research and technological developments.They hold great promise for the discovery of heterogeneity andthe search for personalized treatments. They also allow us to find weak patternsin presence of large individual variations.Salient features of Big Data include experimental variations, computationalcost, noise accumulation, spurious correlations, incidental endogeneity,and measurement errors. These issues should be seriously considered in BigData analysis and in the development of statistical procedures.As an example, we offered here the sparsest solution in high-confidence setsas a generic solution to high-dimensional statistical inference and we derived auseful mean-square error bound. This method combines naturally two piecesof useful information: the data and the sparsity assumption.AcknowledgementThis project was supported by the National Institute of General Medical Sciencesof the National Institutes of Health through Grants R01–GM072611and R01–GMR01GM100474. Partial funding in support of this work was alsoprovided by National Science Foundation grant DMS–1206464. The authorwould like to thank Ahmet Emre Barut, Yuan Liao, and Martin Wainwrightfor help and discussion related to the preparation of this chapter. The authoris also grateful to Christian Genest for many helpful suggestions.ReferencesBickel, P.J. (2008). Discussion on the paper “Sure independence screeningfor ultrahigh dimensional feature space” by Fan and Lv. Journal of theRoyal Statistical Society, Series B, 70:883–884.Bickel, P.J., Ritov, Y., and Zakai, A. (2006). Some theory for generalizedboosting algorithms. The Journal of Machine Learning Research, 7:705–732.Breiman, L. (1998). Arcing classifier. The Annals of Statistics, 26:801–849.


522 Features of Big DataBühlmann, P. and van de Geer, S. (2011). Statistics for High-DimensionalData: Methods, Theory and Applications. Springer,Berlin.Cai, T. and Jiang, T. (2012). Phase transition in limiting distributions ofcoherence of high-dimensional random matrices. Journal of MultivariateAnalysis, 107:24–39.Cai, T., Liu, W., and Luo, X. (2011). A constrained l 1 minimization approachto sparse precision matrix estimation. Journal of the American StatisticalAssociation, 106:594–607.Candès, E.J. and Tao, T. (2007). The Dantzig selector: Statistical estimationwhen p is much larger than n. The Annals of Statistics, 35:2313–2351.Chen, S.S., Donoho, D.L., and Saunders, M.A. (1998). Atomic decompositionby basis pursuit. SIAM Journal on Scientific Computing, 20:33–61.Fan, J. and Fan, Y. (2008). High-dimensional classification using featuresannealed independence rules. The Annals of Statistics, 36:2605.Fan, J., Guo, S., and Hao, N. (2012). Variance estimation using refittedcross-validation in ultrahigh dimensional regression. Journal of the RoyalStatistical Society, Series B, 74:37–65.Fan, J. and Li, R. (2001). Variable selection via non-concave penalized likelihoodand its oracle properties. Journal of the American Statistical Association,96:1348–1360.Fan, J. and Liao, Y. (2014). Endogeneity in ultrahigh dimension. Journal ofthe American Statistical Association, to appear.Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensionalfeature space (with discussion). Journal of the Royal StatisticalSociety, Series B, 70:849–911.Fan, J. and Lv, J. (2011). Nonconcave penalized likelihood with npdimensionality.IEEE Transactions on Information Theory,57:5467–5484.Fan, J., Samworth, R., and Wu, Y. (2009). Ultrahigh dimensional featureselection: Beyond the linear model. The Journal of Machine LearningResearch, 10:2013–2038.Freund, Y. and Schapire, R.E. (1997). A decision-theoretic generalization ofon-line learning and an application to boosting. Journal of Computer andSystem Sciences, 55:119–139.Hall, P., Marron, J.S., and Neeman, A. (2005). Geometric representation ofhigh dimension, low sample size data. Journal of the Royal StatisticalSociety, Series B, 67:427–444.


J. Fan 523Hall, P., Titterington, D.M., and Xue, J.-H. (2009). Tilting methods forassessing the influence of components in a classifier. Journal of the RoyalStatistical Society, Series B, 71:783–803.Hastie, T., Tibshirani, R.J., and Friedman, J. (2009). The Elements of StatisticalLearning. Springer,NewYork.Khalili, A. and Chen, J. (2007). Variable selection in finite mixture of regressionmodels. Journal of the American Statistical Association, 102:1025–1038.Li, R., Zhong, W., and Zhu, L. (2012). Feature screening via distance correlationlearning. Journal of the American Statistical Association, 107:1129–1139.McCullagh, P. and Nelder, J.A. (1989). Generalized Linear Models. Chapman& Hall, London.Negahban, S.N., Ravikumar, P., Wainwright, M.J., and Yu, B. (2012). Aunified framework for high-dimensional analysis of M-estimators withdecomposable regularizers. Statistical Science, 27:538–557.Städler, N., Bühlmann, P., and van de Geer, S. (2010). l 1 -penalization formixture regression models (with discussion). Test, 19:209–256.Stranger, B.E., Nica, A.C., Forrest, M.S., Dimas, A., Bird, C.P., Beazley, C.,Ingle, C.E., Dunning, M., Flicek, P., Koller, D., Montgomery, S., Tavaré,S., Deloukas, P., and Dermitzakis, E.T. (2007). Population genomics ofhuman gene expression. Nature Genetics, 39:1217–1224.Thorisson, G.A., Smith, A.V., Krishnan, L., and Stein, L.D. (2005). The InternationalHapMap Project Web Site. Genome Research, 15:1592–1593.Tibshirani, R.J. (1996). Regression shrinkage and selection via the lasso.Journal of the Royal Statistical Society, Series B, 58:267–288.Vapnik, V. (1999). The Nature of Statistical Learning Theory. Springer,Berlin.Zou, H. and Li, R. (2008). One-step sparse estimates in nonconcave penalizedlikelihood models. The Annals of Statistics, 36:1509–1533.


44Rise of the machinesLarry A. WassermanDepartment of StatisticsCarnegie Mellon University, Pittsburgh, PAOn the 50th anniversary of the COPSS, I reflect on the rise of the field ofmachine learning and what it means for statistics. Machine learning offersa plethora of new research areas, new applications areas and new colleaguesto work with. Our students now compete with those in machine learning forjobs. I am optimistic that visionary statistics departments will embrace thisemerging field; those that ignore or eschew machine learning do so at theirown risk and may find themselves in the rubble of an outdated, antiquatedfield.44.1 IntroductionStatistics is the science of learning from data. Machine learning (ML) is thescience of learning from data. These fields are identical in intent although theydiffer in their history, conventions, emphasis and culture.There is no denying the success and importance of the field of statisticsfor science and, more generally, for society. I’m proud to be a part of the field.The focus of this essay is on one challenge (and opportunity) to our field: therise of machine learning.During my twenty-five year career I have seen machine learning evolvefrom being a collection of rather primitive (yet clever) set of methods to doclassification, to a sophisticated science that is rich in theory and applications.A quick glance at The Journal of Machine Learning Research (jmlr.csail.mit.edu) and NIPS (books.nips.cc) revealspapersonavarietyoftopics that will be familiar to statisticians such as conditional likelihood, sequentialdesign, reproducing kernel Hilbert spaces, clustering, bioinformatics,minimax theory, sparse regression, estimating large covariance matrices,model selection, density estimation, graphical models, wavelets, nonparamet-525


526 Rise of the machinesric regression. These could just as well be papers in our flagship statisticsjournals.This sampling of topics should make it clear that researchers in machinelearning — who were at one time somewhat unaware of mainstream statisticalmethods and theory — are now not only aware of, but actively engaged in,cutting edge research on these topics.On the other hand, there are statistical topics that are active areas ofresearch in machine learning but are virtually ignored in statistics. To avoidbecoming irrelevant, we statisticians need to (i) stay current on research areasin ML and (ii) change our outdated model for disseminating knowledge and(iii) revamp our graduate programs.44.2 The conference cultureML moves at a much faster pace than statistics. At first, ML researchers developedexpert systems that eschewed probability. But very quickly they adoptedadvanced statistical concepts like empirical process theory and concentrationof measure. This transition happened in a matter of a few years. Part of thereason for this fast pace is the conference culture. The main venue for researchin ML is refereed conference proceedings rather than journals.Graduate students produce a stream of research papers and graduate withhefty CV’s. One of the reasons for the blistering pace is, again, the conferenceculture.The process of writing a typical statistics paper goes like this: you havean idea for a method, you stew over it, you develop it, you prove some resultsabout it, and eventually you write it up and submit it. Then the refereeingprocess starts. One paper can take years.In ML, the intellectual currency is conference publications. There area number of deadlines for the main conferences (NIPS, AISTATS, ICML,COLT). The threat of a deadline forces one to quit ruminating and start writing.Most importantly, all faculty members and students are facing the samedeadline so there is a synergy in the field that has mutual benefits. No oneminds if you cancel a class right before the NIPS deadline. And then, after thedeadline, everyone is facing another deadline: refereeing each other’s papersand doing so in a timely manner. If you have an idea and don’t submit a paperon it, then you may be out of luck because someone may scoop you.This pressure is good; it keeps the field moving at a fast pace. If you thinkthis leads to poorly written papers or poorly thought out ideas, I suggest youlook at nips.cc and read some of the papers. There are some substantial, deeppapers. There are also a few bad papers. Just like in our journals. The papersare refereed and the acceptance rate is comparable to our main journals. And


L.A. Wasserman 527if an idea requires more detailed follow-up, then one can always write a longerjournal version of the paper.Absent this stream of constant deadlines, a field moves slowly. This isaproblemforstatisticsnotonlyforitsownsakebutalsobecauseitnowcompetes with ML.Of course, there are disadvantages to the conference culture. Work is donein a rush, and ideas are often not fleshed out in detail. But I think that theadvantages outweigh the disadvantages.44.3 Neglected research areasThere are many statistical topics that are dominated by ML and mostly ignoredby statistics. This is a shame because statistics has much to offer inall these areas. Examples include semi-supervised inference, computationaltopology, online learning, sequential game theory, hashing, active learning,deep learning, differential privacy, random projections and reproducing kernelHilbert spaces. Ironically, some of these — like sequential game theory andreproducing kernel Hilbert spaces — started in statistics.44.4 Case studiesI’m lucky. I am at an institution which has a Machine Learning Department(within the School of Computer Science) and, more importantly, the ML departmentwelcomes involvement by statisticians. So I’ve been fortunate towork with colleagues in ML, attend their seminars, work with ML studentsand teach courses in the ML department.There are a number of topics I’ve worked on at least partly due to myassociation with ML. These include, statistical topology, graphical models,semi-supervised inference, conformal prediction, and differential privacy.Since this paper is supposed to be a personal reflection, let me now brieflydiscuss two of these ML problems that I have had the good fortune to workon. The point of these examples is to show how statistical thinking can beuseful for machine learning.44.4.1 Case study I: Semi-supervised inferenceSuppose we observe data (X 1 ,Y 1 ),...,(X n ,Y n ) and we want to predict Yfrom X. If Y is discrete, this is a classification problem. If Y is realvalued,this is a regression problem. Further, suppose we observe more data


528 Rise of the machines●●●●●●●●●●●FIGURE 44.1Labeled data.X n+1 ,...,X N without the corresponding Y values. We thus have labeled dataL = {(X 1 ,Y 1 ),...,(X n ,Y n )} and unlabeled data U = {X n+1 ,...,X N }. Howdo we use the unlabeled data in addition to the labeled data to improve prediction?This is the problem of semi-supervised inference.Consider Figure 44.1. The covariate is x =(x 1 ,x 2 ) ∈ R 2 .Theoutcomeinthis case is binary as indicated by the circles and squares. Finding the decisionboundary using only the labeled data is difficult. Figure 44.2 shows the labeleddata together with some unlabeled data. We clearly see two clusters. If wemake the additional assumption that Pr(Y =1|X = x) issmoothrelativetothe clusters, then we can use the unlabeled data to nail down the decisionboundary accurately.There are copious papers with heuristic methods for taking advantageof unlabeled data. To see how useful these methods might be, consider thefollowing example. We download one-million webpages with images of catsand dogs. We randomly select 100 pages and classify them by hand. Semisupervisedmethods allow us to use the other 999,900 webpages to constructa good classifier.But does semi-supervised inference work? Or, to put it another way, underwhat conditions does it work? In Azizyan et al. (2012), we showed the following(which I state informally here).Suppose that X i ∈ R d . Let S n denote the set of supervised estimators;these estimators use only the labeled data. Let SS N denote the set of semisupervisedestimators; these estimators use the labeled data and unlabeleddata. Let m be the number of unlabeled data points and suppose that m ≥n 2/(2+ξ) for some 0


L.A. Wasserman 529●●●● ●●● ●●●● ●● ●●● ●●●●●●● ● ●●●●● ● ●●● ● ●●●●●● ●● ●●●●●● ●●●●●●● ●● ●● ●● ●●●●●●●●●● ●●● ●●●●●●●● ● ●●● ●●●●●●●●●●● ●● ●●●● ● ●● ●●● ●●● ●●●●●●● ●●●●●●●●●●● ●● ●●●●●●●●●● ● ●●●● ● ●● ●●●●●●●●●●FIGURE 44.2Labeled and unlabeled data.1. There is a semi-supervised estimator ̂f such that( ) 2sup R P ( ̂f) C2+ξ≤ , (44.1)P ∈P nnwhere R P ( ̂f) =E{ ̂f(X) − f(X)} 2 is the risk of the estimator ̂f underdistribution P .2. For supervised estimators S n ,wehaveinf̂f∈S n(sup R P ( ̂f) C≥P ∈P nn3. Combining these two results, we conclude thatinf ̂f∈SSNsup P ∈Pn R P ( ̂f) ( Cinf ̂f∈Snsup P ∈Pn R P ( ̂f)≤ n) 2d−1. (44.2)) 2(d−3−ξ)(2+ξ)(d−1)−→ 0 (44.3)and hence, semi-supervised estimation dominates supervised estimation.The class P n consists of distributions such that the marginal for X is highlyconcentrated near some lower dimensional set and such that the regressionfunction is smooth on this set. We have not proved that the class must beof this form for semi-supervised inference to improve on supervised inferencebut we suspect that is indeed the case. Our framework includes a parameter αthat characterizes the strength of the semi-supervised assumption. We showedthat, in fact, one can use the data to adapt to the correct value of α.


530 Rise of the machines44.4.2 Case study II: Statistical topologyComputational topologists and researchers in Machine Learning have developedmethods for analyzing the shape of functions and data. Here I’ll brieflyreview some of our work on estimating manifolds (Genovese et al., 2012b,a,c).Suppose that M is a manifold of dimension d embedded in R D . LetX 1 ,...,X n be a sample from a distribution in P supported on M. We observeY i = X i + ɛ i , i ∈{1,...,n} (44.4)where ɛ 1 ,...,ɛ n ∼ Φ are noise variables.Machine learning researchers have derived many methods for estimatingthe manifold M. But this leaves open an important statistical question: howwell do these estimators work? One approach to answering this question is tofind the minimax risk under some loss function. Let ̂M be an estimator of M.AnaturallossfunctionforthisproblemisHausdorffloss:H(M, ̂M){=inf ɛ : M ⊂ ̂M ⊕ ɛ and ̂M}⊂ M ⊕ ɛ . (44.5)Let P be a set of distributions. The parameter of interest is M =support(P )whichweassumeisad-dimensional manifold. The minimax riskisR n =infsupE P [H(̂M,M)]. (44.6)̂M P ∈POf course, the risk depends on what conditions we assume on M and on thenoise Φ.Our main findings are as follows. When there is no noise — so the datafall on the manifold — we get R n ≍ n −2/d .Whenthenoiseisperpendicularto M, theriskisR n ≍ n −2/(2+d) .WhenthenoiseisGaussiantherateisR n ≍ 1/ log n. Thelatterisnotsurprisingwhenoneconsidersthesimilarproblem of estimating a function when there are errors in variables.The implications for machine learning are that, the best their algorithmscan do is highly dependent on the particulars of the type of noise.How do we actually estimate these manifolds in practice? In Genovese et al.(2012c) we take the following point of view: If the noise is not too large, thenthe manifold should be close to a d-dimensional hyper-ridge in the densityp(y) for Y . Ridge finding is an extension of mode finding, which is a commontask in computer vision.Let p be a density on R D .Supposethatp has k modes m 1 ,...,m k . Anintegral curve, or path of steepest ascent, is a path π : R → R D such thatπ ′ (t) = d π(t) =∇p{π(t)}. (44.7)dtUnder weak conditions, the paths π partition the space and are disjoint exceptat the modes (Irwin, 1980; Chacón, 2012).


L.A. Wasserman 531●●●●●●●●●●●●●●● ●●●●●●● ●●●●● ●● ●●●●●●●●●FIGURE 44.3The mean shift algorithm. The data points move along trajectories duringiterations until they reach the two modes marked by the two large asterisks.The mean shift algorithm (Fukunaga and Hostetler, 1974; Comaniciu andMeer, 2002) is a method for finding the modes of a density by following thesteepest ascent paths. The algorithm starts with a mesh of points and thenmoves the points along gradient ascent trajectories towards local maxima. Asimple example is shown in Figure 44.3.Given a function p : R D → R, letg(x) =∇p(x) denotethegradientatxand let H(x) denotetheHessianmatrix.Letλ 1 (x) ≥···≥λ D (x) (44.8)denote the eigenvalues of H(x) and let Λ(x) bethediagonalmatrixwhosediagonal elements are the eigenvalues. Write the spectral decomposition ofH(x) as H(x) =U(x)Λ(x)U(x) ⊤ .Fix0≤ d


532 Rise of the machinesis that the flow passing through x is a gradient ascent path moving towardshigher values of p. Unlike the paths defined by the gradient g which movetowards modes, the paths defined by G move towards ridges.The paths can be parameterized in many ways. One commonly used parameterizationis to use t ∈ [−∞, ∞] where large values of t correspond tohigher values of p. Inthiscaset = ∞ will correspond to a point on the ridge.In this parameterization we can express each integral curve in the flow as follows.A map π : R → R D is an integral curve with respect to the flow of Gifπ ′ (t) =G{π(t)} = L{π(t)}g{π(t)}. (44.10)Definition. The ridge R consists of the destinations of the integral curves:y ∈ R if lim t→∞ π(t) =y for some π satisfying (44.10).As mentioned above, the integral curves partition the space and for eachx/∈ R, thereisauniquepathπ x passing through x. The ridge points are zerosof the projected gradient: y ∈ R implies that G(y) =(0,...,0) ⊤ .Ozertemand Erdogmus (2011) derived an extension of the mean-shift algorithm, calledthe subspace constrained mean shift algorithm that finds ridges which can beapplied to the kernel density estimator. Our results can be summarized asfollows:1. Stability. We showed that if two functions are sufficiently close togetherthen their ridges are also close together (in Hausdorff distance).2. We constructed an estimator ̂R such thatH(R, ̂R) =O P( (log nn) 2)D+8(44.11)where H is the Hausdorff distance. Further, we showed that ̂R is topologicallysimilar to R. We also construct an estimator ̂R h for h>0 thatsatisfies( (log ) 1)H(R h , ̂R n2h )=O P , (44.12)nwhere R h is a smoothed version of R.3. Suppose the data are obtained by sampling points on a manifold andadding noise with small variance σ 2 .Weshowedthattheresultingdensityp has a ridge R σ such thatH(M,R σ )=O ( σ 2 log 3 (1/σ) ) (44.13)and R σ is topologically similar to M. Hence when the noise σ is small, theridge is close to M. Itthenfollowsthat( (log ) 2)H(M, ̂R) nD+8=O P + O ( σ 2 log 3 (1/σ) ) . (44.14)n


L.A. Wasserman 533FIGURE 44.4Simulated cosmic web data.An example can be found in Figures 44.4 and 44.5. I believe that statisticshas much to offer to this area especially in terms of making the assumptionsprecise and clarifying how accurate the inferences can be.44.5 Computational thinkingThere is another interesting difference that is worth pondering. Consider theproblem of estimating a mixture of Gaussians. In statistics we think of thisas a solved problem. You use, for example, maximum likelihood which is implementedby the EM algorithm. But the EM algorithm does not solve theproblem. There is no guarantee that the EM algorithm will actually find theMLE; it’s a shot in the dark. The same comment applies to MCMC methods.In ML, when you say you’ve solved the problem, you mean that thereis a polynomial time algorithm with provable guarantees. There is, in fact,a rich literature in ML on estimating mixtures that do provide polynomialtime algorithms. Furthermore, they come with theorems telling you how manyobservations you need if you want the estimator to be a certain distance fromthe truth, with probability at least 1−δ. Thisistypicalforwhatisexpectedofan estimator in ML. You need to provide a provable polynomial time algorithmand a finite sample (non-asymptotic) guarantee on the estimator.ML puts heavier emphasis on computational thinking. Consider, for example,the difference between P and NP-hard problems. This is at the heart


534 Rise of the machinesFIGURE 44.5Ridge finder applied to simulated cosmic web data.of theoretical computer science and ML. Running an MCMC on an NP-hardproblem might be meaningless. Instead, it may be better to approximate theNP-hard problem with a simpler problem. How often do we teach this to ourstudents?44.6 The evolving meaning of dataFor most of us in statistics, data means numbers. But data now includesimages, documents, videos, web pages, twitter feeds and so on. Traditionaldata — numbers from experiments and observational studies — are still ofvital importance but they represent a tiny fraction of the data out there. Ifwe take the union of all the data in the world, what fraction is being analyzedby statisticians? I think it is a small number.This comes back to education. If our students can’t analyze giant datasetslike millions of twitter feeds or millions of web pages, then other people willanalyze those data. We will end up with a small cut of the pie.


L.A. Wasserman 53544.7 Education and hiringThe goal of a graduate student in statistics is to find an advisor and write athesis. They graduate with a single data point; their thesis work.The goal of a graduate student in ML is to find a dozen different researchproblems to work on and publish many papers. They graduate with a richdata set; many papers on many topics with many different people.Having been on hiring committees for both statistics and ML, I can saythat the difference is striking. It is easier to choose candidates to interview inML. You have a lot of data on each candidate and you know what you aregetting. In statistics, it is a struggle. You have little more than a few papersthat bear their advisor’s footprint.The ML conference culture encourages publishing many papers on manytopics which is better for both the students and their potential employers. Andnow, statistics students are competing with ML students, putting statisticsstudents at a significant disadvantage.There are a number of topics that are routinely covered in ML that werarely teach in statistics. Examples are: Vapnik–Chervonenkis theory, concentrationof measure, random matrices, convex optimization, graphical models,reproducing kernel Hilbert spaces, support vector machines, and sequentialgame theory. It is time to get rid of antiques like UMVUE, complete statisticsand so on, and teach modern ideas.44.8 If you can’t beat them, join themI don’t want to leave the reader with the impression that we are in some sortof competition with ML. Instead, we should feel blessed that a second groupof statisticians has appeared. Working with ML and adopting some of theirideas enriches both fields.ML has much to offer statistics. And statisticians have a lot to offer ML.For example, we put much emphasis on quantifying uncertainty (standarderrors, confidence intervals, posterior distributions), an emphasis that is perhapslacking in ML. And sometimes, statistical thinking casts new light onexisting ML methods. A good example is the statistical view of boosting givenin Friedman et al. (2000). I hope we will see collaboration and cooperationbetween the two fields thrive in the years to come.


536 Rise of the machinesAcknowledgementsI’d like to thank Aaditya Ramdas, Kathryn Roeder, Rob Tibshirani, RyanTibshirani, Isa Verdinelli, a referee and readers of my blog for reading a draftof this essay and providing helpful suggestions.ReferencesAzizyan, M., Singh, A., and Wasserman, L.A. (2013). Density-sensitive semisupervisedinference. The Annals of Statistics, 41:751–771.Chacón, J.E. (2012). Clusters and water flows: A novel approach to modalclustering through morse theory. arXiv preprint arXiv:1212.1384.Comaniciu, D. and Meer, P. (2002). Mean shift: A robust approach towardfeature space analysis. Pattern Analysis and Machine Intelligence, IEEETransactions on,24:603–619.Friedman, J., Hastie, T., and Tibshirani, R.J. (2000). Additive logistic regression:A statistical view of boosting (with discussion). The Annals ofStatistics, 28:337–407.Fukunaga, K. and Hostetler, L.D. (1975). The estimation of the gradientof a density function, with applications in pattern recognition. IEEETransactions on Information Theory,21:32–40.Genovese, C.R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L.A.(2012). Manifold estimation and singular deconvolution under hausdorffloss. The Annals of Statistics, 40:941–963.Genovese, C.R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L.A.(2012). Minimax manifold estimation. Journal of Machine Learning Research,13:1263–1291.Genovese, C.R., Perone-Pacifico, M., Verdinelli, I., and Wasserman, L.A.(2012). Nonparametric ridge estimation. arXiv preprint arXiv:1212.5156.Irwin, M.C. (1980). Smooth Dynamical Systems. Academic Press, New York.Ozertem, U. and Erdogmus, D. (2011). Locally defined principal curves andsurfaces. Journal of Machine Learning Research, 12:1249–1286.


45Atrioofinferenceproblemsthatcouldwinyou a Nobel Prize in statistics (if you helpfund it)Xiao-Li MengDepartment of StatisticsHarvard University, Cambridge, MAStatistical inference is a field full of problems whose solutions require the sameintellectual force needed to win a Nobel Prize in other scientific fields. Multiresolutioninference is the oldest of the trio. But emerging applications such asindividualized medicine have challenged us to the limit: infer estimands withresolution levels that far exceed those of any feasible estimator. Multi-phaseinference is another reality because (big) data are almost never collected,processed, and analyzed in a single phase. The newest of the trio is multisourceinference, which aims to extract information in data coming from verydifferent sources, some of which were never intended for inference purposes. Allof these challenges call for an expanded paradigm with greater emphases onqualitative consistency and relative optimality than do our current inferenceparadigms.45.1 Nobel Prize? Why not COPSS?The title of my chapter is designed to grab attention. But why Nobel Prize(NP)? Wouldn’t it be more fitting, for a volume celebrating the 50th anniversaryof COPSS, to entitle it “A Trio of Inference Problems That Could WinYou a COPSS Award (and you don’t even have to fund it)?” Indeed, somemedia and individuals have even claimed that the COPSS Presidents’ Awardis the NP in Statistics, just as they consider the Fields Medal to be the NPin Mathematics.No matter how our egos might wish such a claim to be true, let us face thereality. There is no NP in statistics, and worse, the general public does not537


538 NP-hard inferenceseem to appreciate statistics as a “rocket science” field. Or as a recent blog(August 14, 2013) in Simply Statistics put it: “Statistics/statisticians needbetter marketing” because (among other reasons)“Our top awards don’t get the press they do in other fields. The NobelPrize announcements are an international event. There is alwaysspeculation/intense interest in who will win. There is similar interestaround the Fields Medal in mathematics. But the top award in statistics,the COPSS award, doesn’t get nearly the attention it should. Partof the reason is lack of funding (the Fields is $15K, the COPSS is $1K).But part of the reason is that we, as statisticians, don’t announce it,share it, speculate about it, tell our friends about it, etc. The prestigeof these awards can have a big impact on the visibility of a field.”The fact that there is more public interest in the Fields than in COPSSshould make most statisticians pause. No right mind would downplay thecentrality of mathematics in scientific and societal advancement throughouthuman history. Statistics seems to be starting to enjoy a similar reputationas being at the core of such endeavors as we move deeper into the digital age.However, the attention around top mathematical awards such as the FieldsMedal has hardly been about their direct or even indirect impact on everydaylife, in sharp contrast to our emphasis on the practicality of our profession.Rather, these awards arouse media and public interest by featuring how ingeniousthe awardees are and how difficult the problems they solved, much likehow conquering Everest bestows admiration not because the admirers care oreven know much about Everest itself but because it represents the ultimatephysical feat. In this sense, the biggest winner of the Fields Medal is mathematicsitself: enticing the brightest talent to seek the ultimate intellectualchallenges.And that is the point I want to reflect upon. Have we statisticians adequatelyconveyed to the media and general public the depth and complexityof our beloved subject, in addition to its utility? Have we tried to demonstratethat the field of statistics has problems (e.g., modeling ignorance) that are asintellectually challenging as the Goldbach conjecture or Riemann Hypothesis,and arguably even more so because our problems cannot be formulated bymathematics alone? In our effort to make statistics as simple as possible forgeneral users, have we also emphasized adequately that reading a couple ofstat books or taking a couple of stat courses does not qualify one to teachstatistics?In recent years I have written about making statistics as easy to learnas possible. But my emphasis (Meng, 2009b) has been that we must make atremendous collective effort to change the perception that “Statistics is easyto teach, but hard (and boring) to learn” to a reality of “Statistics is hardto teach, but easy (and fun) to learn.” Statistics is hard to teach because itis intellectually a very demanding subject, and to teach it well requires bothdepth in theory and breadth in application. It is easy and fun to learn because


X.-L. Meng 539it is directly rooted in everyday life (when it is conveyed as such) and it buildsupon many common logics, not because it lacks challenging problems or deeptheory.Therefore, the invocation of NP in the title is meant to remind ourselvesthat we can also attract the best minds to statistics by demonstrating howintellectually demanding it is. As a local example, my colleague Joe Blitzsteinturned our Stat110 from an enrollment of about 80 to over 480 by making itboth more real-life rooted and more intellectually demanding. The course hasbecome a Harvard sensation, to the point that when our students’ newspaperadvises freshmen “how to make 20% effort and receive 80% grade,” it explicitlystates that Stat110 is an exception and should be taken regardless of the effortrequired. And of course the NPs in the natural and social sciences are aimedat work with enormous depth, profound impact, and ideally both. The trioof inference problems described below share these features — their solutionsrequire developing some of the deepest theory in inference, and their impactsare immeasurable because of their ubiquity in quantitative scientific inquiries.The target readership of this chapter can best be described by a Chineseproverb: “Newborn calves are unafraid of tigers,” meaning those young talentswho are particularly curious and courageous in their intellectual pursuits.I surely hope that future COPSS (if not NP) winners are among them.45.2 Multi-resolution inferenceTo borrow an engineering term, a central task of statistical inference is toseparate signal from noise in the data. But what is signal and what is noise?Traditionally, we teach this separation by writing down a regression model,typically linear,p∑Y = β i X i + ɛ,i=0with the regression function ∑ pi=0 β iX i as signal, and ɛ as noise. Soon we teachthat the real meaning of ɛ is anything that is not captured by our designated“signal,” and hence the “noise” ɛ could still contain, in real terms, signals ofinterest or that should be of interest.This seemingly obvious point reminds us that the concepts of signal andnoise are relative — noise for one study can be signal for another, and viceversa. This relativity is particularly clear for those who are familiar with multiresolutionmethods in engineering and applied mathematics, such as wavelets(see Daubechies, 1992; Meyer, 1993), where we use wavelet coefficients belowor at a primary resolution for estimating signals. The higher frequency ones aretreated as noise and used for variance estimation; see Donoho and Johnstone(1994), Donoho et al. (1995) and Nason (2002). Therefore what counts for


540 NP-hard inferencesignal or noise depends entirely on our choice of the primary resolution. Themulti-resolution framework described below is indeed inspired by my learningof wavelets and related multi-resolution methods (Bouman et al., 2005, 2007;Lee and Meng, 2005; Hirakawa and Meng, 2006), and motivated by the needto deal with Big Data, where the complexity of emerging questions has forcedus to go diving for perceived signals in what would have been discarded asnoise merely a decade ago.But how much of the signal that our inference machine recovers will berobust to the assumptions we make (e.g., via likelihood, prior, estimating equations,etc.) and how much will wash out as noise with the ebb and flow of ourassumptions? Such a question arose when I was asked to help analyze a largenational survey on health, where the investigator was interested in studyingmen over 55 years old who had immigrated to the US from a particular country,among other such “subpopulation analyses.” You may wonder what is sospecial about wanting such an analysis. Well, nothing really, except that therewas not a single man in the dataset who fit the description! I was thereforebrought in to deal with the problem because the investigator had learned thatI could perform the magic of multiple imputation. (Imagine how much datacollection resource could have been saved if I could multiply impute myself!)Surely I could (and did) build some hierarchical model to “borrow information,”as is typical for small area estimations; see Gelman et al. (2003) andRao (2005). In the dataset, there were men over 55, men who immigratedfrom that country, and even men over 55 who immigrated from a neighboringcountry. That is, although we had no direct data from the subpopulation ofinterest, we had plenty of indirect data from related populations, however defined.But how confident should I be that whatever my hierarchical machineproduces is reproducible by someone who actually has direct data from thetarget subpopulation?Of course you may ask why did the investigator want to study a subpopulationwith no direct data whatsoever? The answer turned out to be rathersimple and logical. Just like we statisticians want to work on topics that arenew and/or challenging, (social) scientists want to do the same. They aremuch less interested in repeating well-established results for large populationsthan in making headway on subpopulations that are difficult to study. Andwhat could be more difficult than studying a subpopulation with no data? Indeed,political scientists and others routinely face the problem of empty cellsin contingency tables; see Gelman and Little (1997) and Lax and Phillips(2009).If you think this sounds rhetorical or even cynical, consider the rapidlyincreasing interest in individualized medicine. If I am sick and given a choice oftreatments, the central question to me is which treatment has the best chanceto cure me, not some randomly selected ‘representative’ person. There is nological difference between this desire and the aforementioned investigator’sdesire to study a subpopulation with no observations. The clinical trials testingthese treatments surely did not include a subject replicating my description


X.-L. Meng 541exactly, but this does not stop me from desiring individualized treatments.The grand challenge therefore is how to infer an estimand with granularity orresolution that (far) exceeds what can be estimated directly from the data, i.e.,we run out of enough sample replications (way) before reaching the desiredresolution level.45.2.1 Resolution via filtration and decompositionTo quantify the role of resolution for inference, consider an outcome variableY living on the same probability space as an information filtration{F r ,r = 0,...,R}. For example, F r = σ(X 0 ,...,X r ), the σ-field generatedby covariates {X 0 ,...,X r },whichperhapsisthemostcommonpracticalsituation. The discussion below is general, as long as F r−1 ⊂F r ,wherer ∈{1,...,R} can be viewed as an index of resolution. Intuitively, we canview F r as a set of specifications that restrict our target population — theincreased specification/information as captured by F r allows us to zoom intomore specific subpopulations; here we assume F 0 is the trivial zero-informationfilter, i.e., X 0 represents the constant intercept term, and F R is the maximalfilter, e.g., with infinite resolution to identify a unique individual, and R canbe infinite. Letµ r =E(Y |F r ) and σr 2 =var(Y |F r )be the conditional mean (i.e., regression) and conditional variance (or covariance)of Y given F r ,respectively.WhenF r is generated by {X 0 ,...,X r },wehave the familiar µ r =E(Y |X 0 ,...,X r ) and σ 2 r =var(Y |X 0 ,...,X r ).Applying the familiar EVE lawvar(Y |F r )=E{var(Y |F s )|F r } + var{E(Y |F s )|F r },where s>r, we obtain the conditional ANOVA decompositionσ 2 r =E(σ 2 s|F r )+E{(µ s − µ r ) 2 |F r }. (45.1)This key identity reveals that the (conditional) variance at resolution r is thesum of an estimated variance and an estimated (squared) bias. In particular,we use the information in F r (and our model assumptions) to estimate thevariance at the higher resolution s and to estimate the squared bias incurredfrom using µ r to proxy for µ s .Thisperspectivestressesthatσr 2 is itself alsoan estimator, in fact our best guess at the reproducibility of our indirect datainference at resolution r by someone with direct data at resolution s.This dual role of being simultaneously an estimand (of a lower resolutionestimator) and an estimator (of a higher resolution estimand) is the essence ofthe multi-resolution formulation, unifying the concepts of variance and bias,and of model estimation and model selection. Specifically, when we set up amodel with the signal part at a particular resolution r (e.g., r = p for the linearmodel), we consider µ r to be an acceptable estimate for any µ s with s>r.That is, even though the difference between µ s and µ r reflects systematicvariation, we purposely re-classify it as a component of random variation.


542 NP-hard inferenceIn the strictest sense, bias results whenever real information remains in theresidual variation (e.g., the ɛ term in the linear model). However, statisticianshave chosen to further categorize bias in this strict sense depending on whetherit occurs above or below/at the resolution level r. Whentheinformationinthe residual variation resides in resolutions higher than r then we use theterm “variance” for the price of failing to include that information. Whenthe residual information resides in resolutions lower than or at r, thenwekeep the designation “bias.” This categorization, just as the mathematician’sO notation, serves many useful purposes, but we should not forget that it isultimately artificial.This point is most clear when we apply (45.1) in a telescopic fashion (byfirst making s = r +1andthensummingoverr) and when R = ∞:σ 2 r =E(σ 2 ∞|F r )+∞∑E{(µ i+1 − µ i ) 2 |F r }. (45.2)i=rThe use of R = ∞ is a mathematical idealization of the situations whereour specifications can go on indefinitely, such as with individualized medicine,where we have height, weight, age, gender, race, education, habit, all sorts ofmedical test results, family history, genetic compositions, environmental factors,etc. That is, we switch from the hopeless n =1(i.e.,asingleindividual)case to the hopeful R = ∞ scenario. The σ 2 ∞ term captures the variation ofthe population at infinite resolution. Whether σ 2 ∞ should be set to zero or notreflects whether we believe the world is fundamentally stochastic or appearsto be stochastic because of our human limitation in learning every mechanismresponsible for variations, as captured by F ∞ .Inthatsenseσ 2 ∞ can be viewedas the intrinsic variance with respect to a given filtration. Everything else inthe variance at resolution r are merely biases (e.g., from using µ i to estimateµ i+1 ) accumulated at higher resolutions.45.2.2 Resolution model estimation and selectionWhen σ 2 ∞ = 0, the infinite-resolution setup essentially is the same as a potentialoutcome model (Rubin, 2005), because the resulting population is ofsize one and hence comparisons on treatment effects must be counterfactual.This is exactly the right causal question for individualized treatments: whatwould be my (health, test) outcome if I receive one treatment versus another?In order to estimate such an effect, however, we must lower the resolutionto a finite and often small degree, making it possible to estimate averagetreatment effects, by averaging over a population that permits some degreesof replication. We then hope that the attributes (i.e., predictors) left in the“noise” will not contain enough real signals to alter our quantitative results,as compared to if we had enough data to model those attributes as signals, toa degree that would change our qualitative conclusions, such as choosing onetreatment versus another.


X.-L. Meng 543That is, when we do not have enough (direct) data to estimate µ R ,wefirstchoose a F˜r , and then estimate µ R by ˆµ˜r .The“doubledecoration”notationˆµ˜r highlights two kinds of error:ˆµ˜r − µ R =(ˆµ˜r − µ˜r )+(µ˜r − µ R ). (45.3)The first parenthesized term in (45.3) represents the usual model estimationerror (for the given ˜r), and hence the usual “hat” notation. The second isthe bias induced by the resolution discrepancy between our actual estimandand intended estimand, which represents the often forgotten model selectionerror. As such, we use the more ambiguous “tilde” notation ˜r, becauseitsconstruction cannot be based on data alone, and it is not an estimator of R(e.g., we hope ˜r ≪ R).Determining ˜r, as a model selection problem, then inherits the usual biasvariancetrade-off issue. Therefore, any attempt to find an “automated” wayto determine ˜r would be as disappointing as those aimed at automated proceduresfor optimal bias-variance trade-off (see Meng, 2009a; Blitzstein andMeng, 2010). Consequently, we must make assumptions in order to proceed.Here the hope is that the resolution formulation can provide alternative oreven better ways to pose assumptions suitable for quantifying the trade-off inpractice and for combating other thorny issues, such as nuisance parameters.In particular, if we consider the filtration {F r ,r =0, 1,...} as a cumulative“information basis,” then the choice of ˜r essentially is in the same spirit asfinding a sparse representation in wavelets, for which there is a large literature;see, e.g., Donoho and Elad (2003), Poggio and Girosi (1998), and Yang et al.(2009). Here, though, it is more appropriate to label µ˜r as a parsimoniousrepresentation of µ R .As usual, we can impose assumptions via prior specifications (or penaltyfor penalized likelihood). For example, we can impose a prior on the modelcomplexity ˜R δ ,thesmallest(fixed)r such that E{(µ r − µ R ) 2 }≤δ, whereδrepresents the acceptable trade-off between granularity and model complexity(e.g., involving more X’s) and the associated data and computational cost.Clearly ˜R δ always exists but it may be the case that ˜R δ = R, whichmeansthat no lower-resolution approximation is acceptable for the given δ.Directly posing a prior for ˜R δ is similar to using L 0 -regularization (Linet al., 2010). Its usefulness depends on whether we can expect all X r ’s to bemore or less exchangeable in terms of their predictive power. Otherwise, theresolution framework reminds us to consider putting a prior on the ordering ofthe X i ’s (in terms of predictive power). Conditional on the ordering, we imposepriors on the predictive power of incremental complexity, ∆ r = µ r+1 − µ r .These priors should reflect our expectation for ∆ 2 r to decay with r, such asimposing E(∆ 2 r) > E(∆ 2 r+1). If monotonicity seems too strong an assumption,we could first break the X i ’s into groups, assume exchangeability within eachgroup, and then order the groups according to predictive power. That is to say,finding a complete ordering of the X i ’s may require prior knowledge that is toorefined. We weaken this knowledge requirement by seeking only an ordering


544 NP-hard inferenceover equivalence classes of the X i ’s where each equivalence class represents aset of variables which we are not able to a priori distinguish with respect topredictive power. The telescoping additivity in (45.2) implies that imposing aprior on the magnitude of ∆ r will induce a control over the “total resolutionbias” (TRB)R∑E(µ ˜Rδ− µ R ) 2 = E(µ r − µ r+1 ) 2 ,r= ˜R δwhich holds because ∆ r and ∆ s are orthogonal (i.e., uncorrelated) when s ≠ r.A good illustration of this rationale is provided when F r is generated by aseries of binary variables {X 0 ,...,X r } with r ∈{0,...,R}. In such cases, ourmulti-resolution setup is equivalent to assuming a weighted binary tree modelwith total depth R; see Knuth (1997) and Garey (1974). Here each node isrepresented by a realization of ⃗ X r =(X 0 ,...,X r ), ⃗x r =(x 0 ,...,x r ), at whichthe weights of its two (forward) branches are given by w ⃗xr (x) =E(Y | ⃗ X r =⃗x r ,X r+1 = x) respectivelywithx =0, 1. It is then easy to show thatE(∆ 2 r) ≤ 1 4 E{w ⃗ Xr(1) − w ⃗Xr (0)} 2 ≡ 1 4 E{D2 ( ⃗ X r )},where D 2 ( ⃗ X r ) is a measure of the predictive power of X r+1 that is not alreadycontained in ⃗ X r .Forthepreviouslinearregression,D 2 ( ⃗ X r )=β 2 r+1. Thusputting a prior on D 2 ( ⃗ X r ) can be viewed as a generalization of putting a prioron the regression coefficient, as routinely done in Bayesian variable selection;see Mitchell and Beauchamp (1988) and George and McCulloch (1997).It is worthwhile to emphasize that Bayesian methods, or at least the ideaof introducing assumptions on ∆ r ’s, seems inevitable. This is because “pure”data-driven type of methods, such as cross-validation (Arlot and Celisse,2010), are unlikely to be fruitful here — the basic motivation of a multiresolutionframework is the lack of sufficient replications at high resolutions(unless we impose non-testable exchangeability assumptions to justify syntheticreplications, but then we are just being Bayesian). It is equally importantto point out that the currently dominant practice of pretending µ ˜R= µ Rmakes the strongest Bayesian assumption of all: the TRB, and hence any∆ r (r ≥ ˜R), is exactly zero. In this sense, using a non-trivial prior for ∆ rmakes less extreme assumptions than currently done in practice.In a nutshell, a central aim of putting a prior on ∆ r to regulate the predictivepower of the covariates is to identify practical ways of ordering a set ofcovariates to form the filtration {F r ,r ≥ 0} to achieve rapid decay of E(∆ 2 r)as r increases, essentially the same goal as for stepwise regression or principalcomponent analysis. By exploring the multi-resolution formulation we hope toidentify viable alternatives to common approaches such as LASSO. In general,for the multi-resolution framework to be fruitful beyond the conceptual level,many fundamental and methodological questions must be answered. The threequestions below are merely antipasti to whet your appetite (for NP, or not):


X.-L. Meng 545(a) For what classes of models on {Y,X j ,j =0,...,R} and priors on orderingand predictive power, can we determine practically an order {X (j) ,j ≥ 0}such that the resulting F r = σ(X (j) ,j =0,...,r)willensureaparsimoniousrepresentation of µ R with quantifiably high probability?(b) What should be our guiding principles for making a trade-off betweensample size n and recorded/measured data resolution R, whenwehavethe choice between having more data of lower quality (large n, small R)or less data of higher quality (small n, largeR)?(c) How do we determine the appropriate resolution level for hypothesis testing,considering that hypotheses testing involving higher resolution estimandstypically lead to larger multiplicity? How much multiplicity can wereasonably expect our data to accommodate, and how do we quantify it?45.3 Multi-phase inferenceMost of us learned about statistical modelling in the following way. We have adata set that can be described by a random variable Y ,whichcanbemodelledby a probability function or density Pr(Y |θ). Here θ is a model parameter,which can be of infinite dimension when we adopt a non-parametric or semiparametricphilosophy. Many of us were also taught to resist the temptationof using a model just because it is convenient, mentally, mathematically, orcomputationally. Instead, we were taught to learn as much as possible aboutthe data generating process, and think critically about what makes sensesubstantively, scientifically, and statistically. We were then told to check andre-check the goodness-of-fit, or rather the lack of fit, of the model to our data,and to revise our model whenever our resources (time, energy, and funding)permit.These pieces of advice are all very sound. Indeed, a hallmark of statisticsas a scientific discipline is its emphasis on critical and principled thinkingabout the entire process from data collection to analysis to interpretation tocommunication of results. However, when we take our proud way of thinking(or our reputation) most seriously, we will find that we have not practicedwhat we have preached in a rather fundamental way.I wish this were merely an attention-grabbing statement like the title ofmy chapter. But the reality is that when we put down a single model Pr(Y |θ),however sophisticated or “assumption-free,” we have already simplified toomuch. The reason is simple. In real life, especially in this age of Big Data, thedata arriving at an analyst’s desk or disk are almost never the original rawdata, however defined. These data have been pre-processed, often in multiplephases, because someone felt that they were too dirty to be useful, or too


546 NP-hard inferencelarge to pass on, or too confidential to let the user see everything, or all ofthe above! Examples range from microarrays to astrophysics; see Blocker andMeng (2013).“So what?” Some may argue that all this can be captured by our modelPr(Y |θ), at least in theory, if we have made enough effort to learn about theentire process. Putting aside the impossibility of learning about everythingin practice (Blocker and Meng, 2013), we will see that the single-model formulationis simply not rich enough to capture reality, even if we assume thatevery pre-processor and analyst have done everything correctly. The troublehere is that pre-processors and analysts have different goals, have access todifferent data resources, and make different assumptions. They typically donot and cannot communicate with each other, resulting in separate (model)assumptions that no single probabilistic model can coherently encapsulate.We need a multiplicity of models to capture a multiplicity of incompatibleassumptions.45.3.1 Multiple imputation and uncongenialityIlearnedaboutthesecomplicationsduringmystudyofthemultipleimputation(MI) method (Rubin, 1987), where the pre-processor is the imputer.The imputer’s goal was to preserve as much as possible in the imputed datathe joint distributional properties of the original complete data (assuming, ofcourse, the original complete-data samples were scientifically designed so thattheir properties are worthy of preservation). For that purpose, the imputershould and will use anything that can help, including confidential information,as well as powerful predictive models that may not capture the correctcausal relations.In addition, because the imputed data typically will be used for manypurposes, most of which cannot be anticipated at the time of imputation, theimputation model needs to include as many predictors as possible, and be assaturated as the data and resources permit; see Meng (1994) and Rubin (1996).In contrast, an analysis model, or rather an approach (e.g., given by software),often focuses on specific questions and may involve only a (small) subset ofthe variables used by the imputer. Consequently, the imputer’s model and theuser’s procedure may be uncongenial to each other, meaning that no modelcan be compatible with both the imputer’s model and the user’s procedure.The technical definitions of congeniality are given in Meng (1994) and Xieand Meng (2013), which involve embedding an analyst’s procedure (often offrequentist nature) into an imputation model (typically with Bayesian flavor).For the purposes of the following discussion, two models are “congenial” iftheir implied imputation and analysis procedures are the same. That is, theyare operationally, though perhaps not theoretically, equivalent.Ironically, the original motivation of MI (Rubin, 1987) was a separationof labor, asking those who have more knowledge and resources (e.g., the USCensus Bureau) to fix/impute the missing observations, with the hope that


X.-L. Meng 547subsequent analysts can then apply their favorite complete-data analysis proceduresto reach valid inferences. This same separation creates the issue ofuncongeniality. The consequences of uncongeniality can be severe, from boththeoretical and practical points of view. Perhaps the most striking exampleis that the very appealing variance combining rule for MI inference derivedunder congeniality (and another application of the aforementioned EVE law),namely,var Total = var Between−imputation + var Within−imputation (45.4)can lead to seriously invalid results in the presence of uncongeniality, as reportedinitially by Fay (1992) and Kott (1995).Specifically, the so-called Rubin’s variance combining rule is based on(45.4), wherevar Between−imputation and var Within−imputationare estimated by (1 + m −1 )B m and Ūm, respectively (Rubin, 1987). Here the(1 + m −1 ) factor accounts for the Monte Carlo error due to finite m, B m isthe sampling variance of ˆθ (l) ≡ ˆθ A (Y com) (l) and Ūm is the sample average ofU(Y com),l=1, (l) . . . , m, where ˆθ A (Y com )istheanalyst’scomplete-dataestimatorfor θ, U(Y com ) is its associated variance (estimator), and Y (l)misare i.i.d.draws from an imputation model P I (Y mis |Y obs ). Here, for notational convenience,we assume the complete data Y com can be decomposed into the missingdata Y mis and observed data Y obs . The left-hand side of (45.4) then is meantto be an estimator, denoted by T m , of the variance of the MI estimator of θ,i.e., ¯θ m ,theaverageof{ˆθ (l) ,l=1,...,m}.To understand the behavior of ¯θ m and T m ,letusconsiderarelativelysimple case where the missing data are missing at random (Rubin, 1976), andthe imputer does not have any additional data. Yet the imputer has adopteda Bayesian model uncongenial to the analyst’s complete-data likelihood function,P A (Y com |θ), even though both contain the true data-generating modelas a special case. For example, the analyst may have correctly assumed thattwo subpopulations share the same mean, an assumption that is not in theimputation model; see Meng (1994) and Xie and Meng (2013). Furthermore,we assume the analyst’s complete-data procedure is the fully efficient MLEˆθ A (Y com ), and U A (Y com ), say, is the usual inverse of Fisher information.Clearly we need to take into account both the sampling variability andimputation uncertainty, and for consistency we need to take both imputationsize m →∞and data size n →∞.Thatis,weneedtoconsiderreplicationsgenerated by the hybrid model (note P I (Y mis |Y obs ) is free of θ):P H (Y mis ,Y obs |θ) =P I (Y mis |Y obs )P A (Y obs |θ), (45.5)where P A (Y obs |θ) is derived from the analyst’s complete-data modelP A (Y com |θ).


548 NP-hard inferenceTo illustrate the complication caused by uncongeniality, let us assume m =∞ to eliminate the distraction of Monte Carlo error due to finite m. Writingwe have¯θ ∞ − θ = {¯θ ∞ − ˆθ A (Y com )} + {ˆθ A (Y com ) − θ},var H (¯θ ∞ ) = var H {¯θ ∞ − ˆθ A (Y com )} + var H {ˆθ A (Y com )}+2 cov H {¯θ ∞ − ˆθ A (Y com ), ˆθ A (Y com )}, (45.6)where all the expectations are with respect to the hybrid model defined in(45.5). Since we assume both the imputer’s model and the analyst’s modelare valid, it is not too hard to see intuitively — and to prove under regularityconditions, as in Xie and Meng (2013) — that the first term and secondterm on the right-hand side of (45.6) are still estimated consistently by B mand Ūm, respectively. However, the trouble is that the cross term as given in(45.6) is left out by (45.4), so unless this term is asymptotically negligible,Rubin’s variance estimator of var H (¯θ ∞ )via(45.4)cannotbeconsistent,anobservation first made by Kott (1995).Under congeniality, this term is indeed negligible. This is because, underour current setting, ¯θ ∞ is asymptotically (as n →∞) the same as the analyst’sMLE based on the observed data Y obs ; we denote it, with an abuse of notation,by ˆθ A (Y obs ). But ˆθ A (Y obs ) − ˆθ A (Y com ) and ˆθ A (Y com )mustbeasymptoticallyorthogonal (i.e., uncorrelated) under P A ,whichinturnisasymptoticallythesame as P H due to congeniality (under the usual regularity conditions thatguarantee the equivalence of frequentist and Bayesian asymptotics). Otherwisethere must exist a linear combination of ˆθ A (Y obs ) − ˆθ A (Y com ) and ˆθ A (Y com )—and hence of ˆθ A (Y obs ) and ˆθ A (Y com )—thatisasymptoticallymoreefficientthan ˆθ A (Y com ), contradicting the fact that ˆθ A (Y com )isthefullMLEunderP A (Y com |θ).When uncongeniality arises, it becomes entirely possible that there exists alinear combination of ¯θ ∞ − ˆθ A (Y com ) and ˆθ A (Y com ) that is more efficient thanθ A (Y com ) at least under the actual data generating model. This is because¯θ ∞ may inherit, through the imputed data, additional (valid) informationthat is not available to the analyst, and hence is not captured by P A (Y com |θ).Consequently, the cross-term in (45.6) is not asymptotically negligible, making(45.4) an inconsistent variance estimator; see Fay (1992), Meng (1994), andKott (1995).The above discussion also hints at an issue that makes the multi-phase inferenceformulation both fruitful and intricate, because it indicates that consistencycan be preserved when the imputer’s model does not bring in additional(correct) information. This is a much weaker requirement than congeniality,because it is satisfied, for example, when the analyst’s model is nested within(i.e., less saturated than) the imputer’s model. Indeed, in Xie and Meng (2013)we established precisely this fact, under regularity conditions. However, whenwe assume that the imputer model is nested within the analyst’s model, we


X.-L. Meng 549can prove only that (45.4) has a positive bias. But even this weaker resultrequires an additional assumption — for multivariate θ —thatthelossofinformation is the same for all components of θ. This additional requirementfor multivariate θ was both unexpected and troublesome, because in practicethere is little reason to expect that the loss of information will be the samefor different parameters.All these complications vividly demonstrate both the need for and challengesof the multi-phase inference framework. By multi-phase, our motivationis not merely that there are multiple parties involved, but more critically thatthe phases are sequential in nature. Each phase takes the output of its immediateprevious phase as the input, but with little knowledge of how otherphases operate. This lack of mutual knowledge reality leads to uncongeniality,which makes any single-model framework inadequate for reasons statedbefore.45.3.2 Data pre-processing, curation and provenanceTaking this multi-phase perspective but going beyond the MI setting, we(Blocker and Meng, 2013) recently explored the steps needed for buildinga theoretical foundation for pre-processing in general, with motivating applicationsfrom microarrays and astrophysics. We started with a simple butrealistic two-phase setup, where for the pre-processor phase, the input is Yand the output is T (Y ), which becomes the input of the analysis phase. Thepre-process is done under an “observation model” P Y (Y |X, ξ), where X representsthe ideal data we do not have (e.g., true expression level for each gene),because we observe only a noisy version of it, Y (e.g., observed probe-levelintensities), and where ξ is the model parameter characterizing how Y is relatedto X, including how noises were introduced into the observation process(e.g., background contamination). The downstream analyst has a “scientificmodel” P X (X|θ), where θ is the scientific estimand of interest (e.g., capturingthe organism’s patterns of gene expression). To the analyst, both X and Yare missing, because only T (Y ) is made available to the analyst. For example,T (Y ) could be background corrected, normalized, or aggregated Y .Theanalyst’s task is then to infer θ based on T (Y ) only.Given such a setup, an obvious question is what T (Y )shouldthepreprocessorproduce/keep in order to ensure that the analyst’s inference of θwill be as sharp as possible? If we ignore practical constraints, the answerseems to be rather trivial: choose T (Y ) to be a (minimal) sufficient statisticfor∫P Y (y|θ, ξ) = P Y (y|x; ξ)P X (x|θ)µ(dx). (45.7)But this does not address the real problem at all. There are thorny issuesof dealing with the nuisance (to the analyst) parameter ξ, as well as theissue of computational feasibility and cost. But most critically, because of theseparation of the phases, the scientific model P X (X|θ) and hence the marginal


550 NP-hard inferencemodel P Y (Y |θ, ξ) of (45.7) is typically unknown to the pre-processor. At thevery best, the pre-processor may have a working model ˜PX (X|η), where ηmay not live even on the same space as θ. Consequently,thepre-processormay produce T (Y ) as a (minimal) sufficient statistic with respect to∫˜P Y (y|η, ξ) = P Y (y|x; ξ) ˜P X (x|η)µ(dx). (45.8)A natural question then is what are sufficient and necessary conditions onthe pre-processor’s working model such that a T (Y )(minimally)sufficientfor(45.8) will also be (minimally) sufficient for (45.7). Or to use computer sciencejargon, when is T (Y ) a lossless compression (in terms of statistical efficiency)?Evidently, we do not need the multi-phase framework to obtain trivial anduseless answers such as setting T (Y )=Y (which will be sufficient for anymodel of Y only) or requiring the working model to be the same as the scientificmodel (which tells us nothing new). The multi-phase framework allows usto formulate and obtain theoretically insightful and practically relevant resultsthat are unavailable in the single-phase framework. For example, in Blockerand Meng (2013), we obtained a non-trivial sufficient condition as well as anecessary condition (but they are not the same) for preserving sufficiency undera more general setting involving multiple (parallel) pre-processors duringthe pre-process phase. The sufficient condition is in the same spirit as thecondition for consistency of Rubin’s variance rule under uncongeniality. Thatis, in essence, sufficiency under (45.8) implies sufficiency under (45.7) whenthe working model is more saturated than the scientific model. This is ratherintuitive from a multi-phase perspective, because the fewer assumptions wemake in earlier phases, the more flexibility the later phases inherit, and consequently,the better the chances these procedures preserve information ordesirable properties.There is, however, no free lunch. The more saturated our model is, the lesscompression it achieves by statistical sufficiency. Therefore, in order to makeour results as practically relevant as possible, we must find ways to incorporatecomputational efficiency into our formulation. However, establishing a generaltheory for balancing statistical and computational efficiency is an extremelychallenging problem. The central difficulty is well known: statistical efficiencyis an inherent property of a procedure, but the computational efficiency canvary tremendously across computational architectures and over time.For necessary conditions, the challenge is of a different kind. Preserving sufficiencyis a much weaker requirement than preserving a model, even for minimalsufficiency. For example, N (µ, 1) and Poisson(λ) donotshareeventhesame state space. However, the sample mean is a minimal sufficient statisticfor both models. Therefore, a pre-processing model could be seriously flawedyet still lead to the best possible pre-processing (this could be viewed as a caseof action consistency; see Section 45.5). This type of possibility makes buildingamulti-phaseinferencetheorybothintellectuallydemandingandintriguing.


X.-L. Meng 551In general, “What to keep?” or “Who will share what, with whom, when,and why?” are key questions for the communities in information and computersciences, particularly in the areas of data curation and data provenance; seeBorgman (2010) and Edwards et al. (2011). Data/digital curation, as definedby the US National Academies, is “the active management and enhancement ofdigital information assets for current and future use,” and data provence is “arecord that describes the people, institutions, entities, and activities involvedin producing, influencing, or delivering a piece of data or a thing” (Moreauet al., 2013). Whereas these fields are clearly critical for preserving data qualityand understanding the data collection process for statistical modelling,currently there is little dialogue between these communities and statisticiansdespite shared interests. For statisticians to make meaningful contributions,we must go beyond the single-phase/single-model paradigm because the fundamentalproblems these fields address involve, by default, multiple parties,who do not necessarily (or may not even be allowed to) share information,and yet they are expected to deliver scientifically useful data and digital information.I believe the multi-phase inference framework will provide at least a relevantformulation to enter the conversation with researchers in these areas.Of course, there is a tremendous amount of foundation building to be done,even just to sort out which results in the single-phase framework are directlytransferable and which are not. The three questions below again are just anappetizer:(a) What are practically relevant theoretical criteria for judging the qualityof pre-processing, without knowing how many types of analyses ultimatelywill be performed on the pre-processed data?(b) What are key considerations and methods for formulating generally uncongenialityfor multi-phase inference, for quantifying the degrees of uncongeniality,and for setting up a threshold for a tolerable degree?(c) How do we quantify trade-offs between efficiencies that are designed formeasuring different aspects of the multi-phase process, such as computationalefficiency for pre-processing and statistical efficiency for analysis?45.4 Multi-source inferenceAs students of statistics, we are all taught that a scientific way of collectingdata from a population is to take a probabilistic sample. However, thiswas not the case a century ago. It took about half a century since its formalintroduction in 1895 by Anders Nicolai Kiær (1838–1919), the founderof Statistics Norway, before probabilistic sampling became widely understood


552 NP-hard inferenceand accepted (see Bethlehem, 2009). Most of us now can explain the ideaintuitively by analogizing it with common practices such as that only a tinyamount of blood is needed for any medical test (a fact for which we are allgrateful). But it was difficult then for many — and even now for some — tobelieve that much can be learned about a population by studying only, say,a 5% random sample. Even harder was the idea that a 5% random sampleis better than a 5% “quota sample,” i.e., a sample purposefully chosen tomimic the population. (Very recently a politician dismissed an election poolas “non-scientific” because “it is random.”)Over the century, statisticians, social scientists, and others have amplydemonstrated theoretically and empirically that (say) a 5% probabilistic/randomsample is better than any 5% non-random samples in many measurableways, e.g., bias, MSE, confidence coverage, predictive power, etc. However,we have not studied questions such as “Is an 80% non-random sample‘better’ than a 5% random sample in measurable terms? 90%? 95%? 99%?”This question was raised during a fascinating presentation by Dr. JeremyWu, then (in 2009) the Director of LED (Local Employment Dynamic), a pioneeringprogram at the US Census Bureau. LED employed synthetic data tocreate an OnTheMap application that permits users to zoom into any localregion in the US for various employee-employer paired information withoutviolating the confidentiality of individuals or business entities. The syntheticdata created for LED used more than 20 data sources in the LEHD (LongitudinalEmployer-Household Dynamics) system. These sources vary fromsurvey data such as a monthly survey of 60,000 households, which representonly .05% of US households, to administrative records such as unemploymentinsurance wage records, which cover more than 90% of the US workforce, tocensus data such as the quarterly census of earnings and wages, which includesabout 98% of US jobs (Wu, 2012 and personal communication from Wu).The administrative records such as those in LEHD are not collected forthe purpose of statistical inference, but rather because of legal requirements,business practice, political considerations, etc. They tend to cover a large percentageof the population, and therefore they must contain useful informationfor inference. At the same time, they suffer from the worst kind of selectionbiases because they rely on self-reporting, convenient recording, and all sortsof other “sins of data collection” that we tell everyone to avoid.But statisticians cannot avoid dealing with such complex combined datasets, because they are playing an increasingly vital role for official statisticalsystems and beyond. For example, the shared vision from a 2012 summitmeeting, between the government statistical agencies from Australia, Canada,New Zealand, the United Kingdom, and the US, includes“Blending together multiple available data sources (administrative andother records) with traditional surveys and censuses (using paper,internet, telephone, face-to-face interviewing) to create high quality,timely statistics that tell a coherent story of economic, social and en-


X.-L. Meng 553vironmental progress must become a major focus of central governmentstatistical agencies.” (Groves, February 2, 2012)Multi-source inference therefore refers to situations where we need to drawinference by using data coming from different sources and some (but not all)of which were not collected for inference purposes. It is thus broader and morechallenging than multi-frame inference, where multiple data sets are collectedfor inference purposes but with different survey frames; see Lohr and Rao(2006). Most of us would agree that the very foundation of statistical inferenceis built upon having a representative sample; even in notoriously difficultobservational studies, we still try hard to create pseudo “representative” samplesto reduce the impact of confounding variables. But the availability of avery large subpopulation, however biased, poses new opportunities as well aschallenges.45.4.1 Large absolute size or large relative size?Let us consider a case where we have an administrative record covering f apercent of the population, and a simple random sample (SRS) from the samepopulation which only covers f s percent, where f s ≪ f a . Ideally, we want tocombine the maximal amount of information from both of them to reach ourinferential conclusions. But combining them effectively will depend criticallyon the relative information content in them, both in terms of how to weightthem (directly or implied) and how to balance the gain in information with theincreased analysis cost. Indeed, if the larger administrative dataset is foundto be too biased relative to the cost of processing it, we may decide to ignoreit. Wu’s question therefore is a good starting point because it directly askshow the relative information changes as their relative sizes change: how largeshould f a /f s be before an estimator from the administrative record dominatesthe corresponding one from the SRS, say in terms of MSE?As an initial investigation, let us denote our finite population by{x 1 ,...,x N }. For the administrative record, we let R i =1wheneverx i isrecorded and zero otherwise; and for SRS, we let I i =1ifx i is sampled, andzero otherwise, where i ∈{1,...,N}. Here we assume n a = ∑ Ni=1 R i ≫ n s =∑ Ni=1 I i, and both are considered fixed in the calculations below. Our keyinterest here is to compare the MSEs of two estimators of the finite-samplepopulation mean ¯X N , namely,¯x a = 1 n aN∑i=1x i R i and ¯x s = 1 ∑ Nx i I i .n sRecall for finite-population calculations, all x i ’s are fixed, and all the randomnesscomes from the response/recording indicator R i for ¯x a and the samplingindicator I i for ¯x s . Although the administrative record has no probabilisticmechanism imposed by the data collector, it is a common strategy to modelthe responding (or recording or reporting) behavior via a probabilistic model.i=1


554 NP-hard inferenceHere let us assume that a probit regression model is adequate to capturethe responding behavior, which depends on only the individual’s x value.That is, we can express R i = 1(Z i ≤ α + βx i ), where the Z i ’s form an i.i.dsample from N (0, 1). We could imagine Z i being, e.g., the ith individual’slatent “refusal tendency,” and when it is lower than a threshold that is linearin x i ,theindividualresponds.Theinterceptα allows us to model the overallpercentage of respondents, with larger α implying more respondents. The slopeβ models the strength of the self-selecting mechanism. In other words, as longas β ≠ 0, we have a non-ignorable missing-data mechanism (Rubin, 1976).Given that ¯x s is unbiased, its MSE is the same as its variance (Cochran,2007), viz.var(¯x s )= 1 − f sn sS 2 N (x), where S 2 N (x) = 1N − 1N∑(x i − ¯x N ) 2 . (45.9)The MSE of ¯x a is more complicated, mostly because R i depends on x i .Butunder our assumption that N is very large and f a = n a /N stays (far) awayfrom zero, the MSE is completely dominated by the squared bias term of ¯x a ,which itself is well approximated by, again because N (and hence n a )isverylarge,{ ∑N } 2Bias 2 i=1(¯x a )=(x i − ¯x N )p(x i )∑ Ni=1 p(x , (45.10)i)where p(x i )=E(R i |x i )=Φ(α + βx i ), and Φ is the CDF for N (0, 1).To get a sense of how this bias depends on f a ,letusassumethatthefinitepopulation {x 1 ,...,x N } itself can be viewed as an SRS of size N from a superpopulation X ∼N(µ, σ 2 ). By the Law of Large Numbers, the bias term in(45.10) is essentially the same as (again because N is very large)⎛ ⎞φ ⎝˜α√ ⎠cov{X, p(X)} σE{ZΦ(˜α + ˜βZ)} σ=E{p(X)} E{Φ(˜α + ˜βZ)}= √˜β 1+ ˜β 2⎛ ⎞1+ ˜β, (45.11)2Φ ⎝˜α√ ⎠1+ ˜β 2where ˜α = α + βµ, ˜β = σβ, Z ∼ N(0, 1), and φ is its density function.Integration by parts and properties of Normals are used for arriving at (45.11).An insight is provided by (45.11) when we note Φ{˜α/(1 + ˜β 2 ) 1/2 } is wellestimated by f a because N is large, and hence ˜α/(1+ ˜β 2 ) 1/2 ≈ Φ −1 (f a )=z fa ,where z q is the qth quantile of N (0, 1). Consequently, we have from (45.11),MSE(¯x a )σ 2 ≈ Bias2 (¯x a )σ 2 = ˜β 2 φ 2 (z fa )1+ ˜β 2 fa2 = ˜β 2 e −z2 fa1+ ˜β 2 2πfa2 , (45.12)i=1


X.-L. Meng 555which will be compared to (45.9) after replacing S 2 N (X) byσ2 . That is,MSE(¯x s )σ 2 = 1 n s− 1 N ≈ 1 n s, (45.13)where 1/N is ignored for the same reason that var(¯x a )=O(N −1 )isignored.It is worthy to point out that the seemingly mismatched units in comparing(45.12), which uses relative size f a , with (45.13), which uses the absolute sizen s , reflects the different natures of non-sampling and sampling errors. Theformer can be made arbitrarily small only when the relative size f a is madearbitrarily large, that is f a → 1; just making the absolute size n a large willnot do the trick. In contrast, as is well known, we can make (45.13) arbitrarilysmall by making the absolute size n s arbitrarily large even if f s → 0whenN →∞.Indeed,formostpublic-usedatasets,f s is practically zero. Forexample, with respect to the US population, an f s = .01% would still rendern s more than 30,000, large enough for controlling sampling errors for manypractical purposes. Indeed, (45.13) will be no greater than .000033. In contrast,if we were to use an administrative record of the same size, i.e., if f a = .01%,then (45.12) will be greater than 3.13, almost 100,000 times (45.13), if ˜β = .5.However, if f a =95%,z fa =1.645, (45.12) will be .00236, for the same ˜β =.5. This implies that as long as n s does not exceed about 420, the estimatorfrom the biased sample will have a smaller MSE (assuming, of course, N ≫420). The threshold value for n s will drop to about 105 if we increase ˜β to 2,but will increase substantially to about 8,570 if we drop ˜β to .1. We must bemindful, however, that these comparisons assume the SRS and more generallythe survey data have been collected perfectly, which will not be the case inreality because of both non-responses and response biases; see Liu et al. (2013).Hence in reality it would take a smaller f a to dominate the probabilistic samplewith f s sampling fraction, precisely because the latter has been contaminatedby non-probabilistic selection errors as well. Nevertheless, a key message hereis that, as far as statistical inference goes, what makes a “Big Data” set bigis typically not its absolute size, but its relative size to its population.45.4.2 Data defect indexThe sensitivity of our comparisons above to ˜β is expected because it governsthe self-reporting mechanism. In general, whereas closed-form expressions suchas (45.12) are hard to come by, the general expression in (45.10) leads toBias 2 (¯x a )S 2 N (x) = ρ 2 N (x, p){ } 2 (SN (p) N − 1¯p N N) 2


556 NP-hard inferenceThe (middle) re-expression of the bias given in (45.14) in terms of thecorrelation between sampling variable x and sampling/response probability pis a standard strategy in the survey literature; see Hartley and Ross (1954)and Meng (1993). Although mathematically trivial, it provides a greater statisticalinsight, i.e., the sample mean from an arbitrary sample is an unbiasedestimator for the target population mean if and only if the sampling variablex and the data collection mechanism p(x) are uncorrelated. In this sense wecan view ρ N (x, p) as a “defect index” for estimation (using sample mean) dueto the defect in data collection/recording. This result says that we can reduceestimation bias of the sample mean for non-equal probability samples or evennon-probability samples as long as we can reduce the magnitude of the correlationbetween x and p(x). This possibility provides an entryway into dealingwith a large but biased sample, and exploiting it may require less knowledgeabout p(x) thanrequiredforotherbiasreductiontechniquessuchas(inverseprobability) weighting, as in the Horvitz-Thompson estimator.The (right-most) inequality in (45.14) is due to the fact that for any randomvariable satisfying U ∈ [0, 1], var(U) ≤ E(U){1 − E(U)}. Thisboundallows us to control the bias using only the proportion ¯p N ,whichiswellestimatedby the observed sample fraction f a .Itsaysthatwecanalsocontrolthe bias by letting f a approach one. In the traditional probabilistic samplingcontext, this observation would only induce a “duhhh” response, but in thecontext of multi-source inference it is actually a key reason why an administrativerecord can be very useful despite being a non-probabilistic sample.Cautions are much needed however, because (45.14) also indicates thatit is not easy at all to use a large f a to control the bias (and hence MSE).By comparing (45.13) and the bound in (45.14) we will need (as a sufficientcondition)f a > n sρ 2 N (x, p)1+n s ρ 2 N(x, p)in order to guarantee MSE(¯x a ) < MSE(¯x s ). For example, even if n s = 100, wewould need over 96% of the population if ρ N = .5. This reconfirms the powerof probabilistic sampling and reminds us of the danger in blindly trusting that“Big Data” must give us better answers. On the other hand, if ρ N = .1, thenwe will need only 50% of the population to beat a SRS with n s = 100. Ifn s =100seemstoosmallinpractice,thesameρ N = .1 also implies that a96% subpopulation will beat a SRS as large as n s = ρ −2N {f a/(1 − f a )} = 2400,which is no longer a practically irrelevant sample size.Of course all these calculations depend critically on knowing the value ofρ N ,whichcannotbeestimatedfromthebiasedsampleitself.However,recallfor multi-source inference we will also have at least a (small) probabilisticsample. The availability of both small random sample(s) and large non-randomsample(s) opens up many possibilities. The following (non-random) sample ofquestions touch on this and other issues for multi-source inference:


X.-L. Meng 557(a) Given partial knowledge of the recording/response mechanism for a (large)biased sample, what is the optimal way to create an intentionally biasedsub-sampling scheme to counter-balance the original bias so the resultingsub-sample is guaranteed to be less biased than the original biased samplein terms of the sample mean, or other estimators, or predictive power?(b) What should be the key considerations when combining small randomsamples with large non-random samples, and what are the sensible “cornercutting”guidelines when facing resource constraints? How can the combineddata help to estimate ρ N (x, p)? In what ways can such estimatorsaid multi-source inference?(c) What are theoretically sound and practically useful defect indices for prediction,hypothesis testing, model checking, clustering, classification, etc.,as counterparts to the defect index for estimation, ρ N (x, p)? What aretheir roles in determining information bounds for multi-source inference?What are the relevant information measures for multi-source inference?45.5 The ultimate prize or priceAlthough we have discussed the trio of inference problems separately, manyreal-life problems involve all of them. For example, the aforementioned On-TheMap application has many resolution levels (because of arbitrary zoom-in),many sources of data (more than 20 sources), and many phases of pre-process(even God would have trouble keeping track of all the processing that thesetwenty some survey, census, and administrative data sets have endured!), includingthe entire process of producing the synthetic data themselves. Personalizedmedicine is another class of problems where one typically encounters allthree types of complications. Besides the obvious resolution issue, typically thedata need to go through pre-processing in order to protect the confidentialityof individual patients (beyond just removing the patient’s name). Yet individuallevel information is most useful. To increase the information content, weoften supplement clinical trial data with observational data, for example, onside effects when the medications were used for another disease.To bring the message home, it is a useful exercise to imagine ourselvesin a situation where our statistical analysis would actually be used to decidethe best treatment for a serious disease for a loved one or even for ourselves.Such a “personalized situation” emphasizes that it is my interest/life at stake,which should encourage us to think more critically and creatively, not just topublish another paper or receive another prize. Rather, it is about getting tothe bottom of what we do as statisticians — to transform whatever empiricalobservations we have into the best possible quantitative evidence for scientific


558 NP-hard inferenceunderstanding and decision making, and more generally, to advance science,society, and civilization. That is our ultimate prize.However, when we inappropriately formulate our inference problems formental, mathematical, or computational convenience, the chances are thatsomeone or, in the worst case, our entire society will pay the ultimate price.We statisticians are quick to seize upon the 2008 world-wide financial crisis asan ultimate example in demonstrating how a lack of understanding and properaccounting for uncertainties and correlations leads to catastrophe. Whereasthis is an extreme case, it is unfortunately not an unnecessary worry that ifwe continue to teach our students to think only in a single-resolution, singlephase,single-source framework, then there is only a single outcome: they willnot be at the forefront of quantitative inference. When the world is full ofproblems with complexities far exceeding what can be captured by our theoreticalframework, our reputation for critical thinking about the entirety ofthe inference process, from data collection to scientific decision, cannot stand.The “personalized situation” also highlights another aspect that our currentteaching does not emphasize enough. If you really had to face the unfortunateI-need-treatment-now scenario, I am sure your mind would not be(merely) on whether the methods you used are unbiased or consistent. Rather,the type of questions you may/should be concerned with are (1) “WouldI reach a different conclusion if I use another analysis method?” or (2) “HaveI really done the best given my data and resource constraints?” or (3) “Wouldmy conclusion change if I were given all the original data?”Questions (1) and (2) remind us to put more emphasis on relative optimality.Whereas it is impossible to understand all biases or inconsistencies inmessy and complex data, knowledge which is needed to decide on the optimalmethod, we still can and should compare methods relative to each other, aswell as relative to the resources available (e.g., time, energy, funding). Equallyimportant, all three questions highlight the need to study much more qualitativeconsistency or action consistency than quantitative consistency (e.g.,the numerical value of our estimator reaching the exact truth in the limit).Our methods, data sets, and numerical results can all be rather different (e.g.,a p-value of .2 versus .8), yet their resulting decisions and actions can stillbe identical because typically there are only two (yes and no) or at most ahandful of choices.It is this “low resolution” of our action space in real life which providesflexibility for us to accept quantitative inconsistency caused by defects such asresolution discrepancy, uncongeniality or selection bias, yet still reach scientificallyuseful inference. It permits us to move beyond single-phase, single-source,or single resolution frameworks, but still be able to obtain theoretically elegantand practically relevant results in the same spirit as those NP-worthyfindings in many other fields. I therefore very much hope you will join me forthis intellectually exciting and practically rewarding research journey, unless,of course, you are completely devoted to fundraising to establish an NP instatistics.


X.-L. Meng 559AcknowledgementsThe material on multi-resolution inference benefitted greatly from criticalcomments by Alex Blocker and Keli Liu, both of whom also provided many insightfulcomments throughout, as did David Jones. The joint work with AlexBlocker and Xianchao Xie (cited in the reference list) shaped the formulationof the multi-phase inference, which was greatly encouraged by ChristineBorgman, who also taught me, together with Alyssa Goodman, Paul Groth,and Margaret Hedstrom, data curation and data provenance. Dr. Jeremy Wuinspired and encouraged me to formulate the multi-source inference, and providedextremely helpful information and insights regarding the LED/LEHDprogram. Keli Liu also provided invaluable editing and proofreading, as didSteven Finch. “Good stuff!” coming from my academic twin brother AndrewGelman was all the encouragement I needed to squeeze out every possibleminute between continental breakfasts and salmon/chicken dinners. I givethem 100% thanks, but 0% liability for any naïveté, wishful thinking, and signof lack of sleep — this has been the most stressful paper I have ever written.I also thank the NSF for partial financial support, and the Editors, especiallyXihong Lin and Geert Molenberghs, for help and extraordinary patience.ReferencesArlot, S. and Celisse, A. (2010). A survey of cross-validation procedures formodel selection. Statistics Surveys, 4:40–79.Bethlehem, J. (2009). The Rise of Survey Sampling. CBS Discussion PaperNo. 9015.Blitzstein, J. and Meng, X.-L. (2010). Nano-project qualifying exam process:An intensified dialogue between students and faculty. The AmericanStatistician, 64:282–290.Blocker, A.W. and Meng, X.-L. (2013). The potential and perils of preprocessing:Building new foundations. Bernoulli, 19:1176–1211.Borgman, C.L. (2010). Research data: Who will share what, with whom,when, and why? China-North America Library Conference, Beijing, People’sRepublic of China.Bouman, P., Dukic, V. and Meng, X.-L. (2005). A Bayesian multiresolutionhazard model with application to an AIDS reporting delay study. StatisticaSinica, 15:325–357.


560 NP-hard inferenceBouman, P., Meng, X.-L., Dignam, J., and Dukić, V. (2007). A multiresolutionhazard model for multicenter survival studies: Application to tamoxifentreatment in early stage breast cancer. Journal of the AmericanStatistical Association, 102:1145–1157.Cochran, W.G. (2007). Sampling Techniques. Wiley,NewYork.Daubechies, I. (1992). Ten Lectures on Wavelets. SIAM.Donoho, D.L. and Elad, M. (2003). Optimally sparse representation in general(nonorthogonal) dictionaries via l 1 minimization. Proceedings of theNational Academy of Sciences, 100:2197–2202.Donoho, D.L. and Johnstone, I.M. (1994). Ideal spatial adaptation by waveletshrinkage. Biometrika, 81:425–455.Donoho, D.L., Johnstone, I.M., Kerkyacharian, G., and Picard, D. (1995).Wavelet shrinkage: Asymptopia? (with discussion). Journal of the RoyalStatistical Society, Series B, 57:301–369.Edwards, P.N., Mayernik, M.S., Batcheller, A.L., Bowker, G.C., andBorgman, C.L. (2011). Science friction: Data, metadata, and collaboration.Social Studies of Science, 41:667–690.Fay, R.E. (1992). When are inferences from multiple imputation valid? Proceedingsof the Survey Research Methods Section, American StatisticalAssociation, Washington, DC, pp. 227–232.Garey, M. (1974). Optimal binary search trees with restricted maximal depth.SIAM Journal on Computing, 3:101–110.Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2003). BayesianData Analysis. Chapman&Hall,London.Gelman, A. and Little, T.C. (1997). Poststratification into many categoriesusing hierarchical logistic regression. Survey Methodology, 23:127–35.George, E.I. and McCulloch, R.E. (1997). Approaches for Bayesian variableselection. Statistica Sinica, 7:339–373.Groves, R.M. (February 2, 2012). National statistical offices: Independent,identical, simultaneous actions thousands of miles apart US Census BureauDirector’s Blog, http://blogs.census.gov/directorsblog/.Hartley, H. and Ross, A. (1954). Unbiased ratio estimators. Nature, 174:270–271.Hirakawa, K. and Meng, X.-L. (2006). An empirical Bayes EM-wavelet unificationfor simultaneous denoising, interpolation, and/or demosaicing. InImage Processing, 2006 IEEE International Conference on. IEEE,pp.1453–1456.


X.-L. Meng 561Knuth, D. (1997). The Art of Computer Programming, Vol 1. FundamentalAlgorithms, 3rd edition. Addison-Wesley, Reading, MA.Kott, P.S. (1995). A paradox of multiple imputation. Proceedings of the SurveyResearch Methods Section, American Statistical Association, Washington,DC, pp. 380–383.Lax, J.R. and Phillips, J.H. (2009). How should we estimate public opinionin the states? American Journal of Political Science, 53:107–121.Lee, T.C. and Meng, X.-L. (2005). A self-consistent wavelet method for denoisingimages with missing pixels. In Proceedings of the 30th IEEE InternationalConference on Acoustics, Speech, and Signal Processing, 2:41–44.Lin, D., Foster, D.P., and Ungar, L.H. (2010). ARiskRatioComparisonofl 0 and l 1 Penalized Regressions. TechnicalReport,UniversityofPennsylvania,Philadelphia, PA.Liu, J., Meng, X.-L., Chen, C.-N. and Alegrita, M. (2013). Statistics canlie but can also correct for lies: Reducing response bias in NLAAS viaBayesian imputation. Statistics and Its Interface, 6:387–398.Lohr, S. and Rao, J.N.K. (2006). Estimation in multiple-frame surveys. Journalof the American Statistical Association, 101:1019–1030.Meng, X.-L. (1993). On the absolute bias ratio of ratio estimators. Statistics&ProbabilityLetters,18:345–348.Meng, X.-L. (1994). Multiple-imputation inferences with uncongenial sourcesof input (with discussion). Statistical Science, 9:538–558.Meng, X.-L. (2009a). Automated bias-variance trade-off: Intuitive inadmissibilityor inadmissible intuition? In Frontiers of Statistical Decision Makingand Bayesian Analysis (M.H. Chen, D.K. Dey, P. Mueller, D. Sun,and K. Ye, Eds.). Springer, New York, pp. 95–112.Meng, X.-L. (2009b). Desired and feared — What do we do now and overthe next 50 years? The American Statistician, 63:202–210.Meyer, Y. (1993). Wavelets-algorithms and applications. Wavelets-Algorithms and Applications, 1:142.Mitchell, T.J. and Beauchamp, J.J. (1988). Bayesian variable selection in linearregression. Journal of the American Statistical Association, 83:1023–1032.Moreau, L., Belhajjame, K., B’Far, R., Cheney, J., Coppens, S., Cresswell,S., Gil, Y., Groth, P., Klyne, G., Lebo, T., McCusker, J., Miles, S., Myers,J., Sahoo, S., and Tilmes, C., Eds (2013). PROV-DM: The PROV DataModel. Technical Report, World Wide Web Consortium.


562 NP-hard inferenceNason, G.P. (2002). Choice of wavelet smoothness, primary resolution andthreshold in wavelet shrinkage. Statistics and Computing, 12:219–227.Poggio, T. and Girosi, F. (1998). A sparse representation for function approximation.Neural Computation, 10:1445–1454.Rao, J.N.K. (2005). Small Area Estimation. Wiley,NewYork.Rubin, D.B. (1976). Inference and missing data. Biometrika, 63:581–592.Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley,New York.Rubin, D.B. (1996). Multiple imputation after 18+ years. Journal of theAmerican Statistical Association, 91:473–489.Rubin, D.B. (2005). Causal inference using potential outcomes. Journal ofthe American Statistical Association, 100:322–331.Wu, J. (2012). 21st century statistical systems. Blog: NotRandomThought,August 1, 2012. Available at http://jeremyswu.blogspot.com/.Xie, X. and Meng, X.-L. (2013). Dissecting multiple imputation from a multiphaseinference perspective: What happens when there are three uncongenialmodels involved? The Annals of Statistics, underreview.Yang, J., Peng, Y., Xu, W., and Dai, Q. (2009). Ways to sparse representation:An overview. Science in China Series F: Information Sciences,52:695–703.


Part VAdvice for the nextgeneration


46Inspiration, aspiration, ambitionC.F. Jeff WuSchool of Industrial and Systems EngineeringGeorgia Institute of Technology, Atlanta, GA46.1 Searching the source of motivationOne can describe the motivation or drive for accomplishments or scholarshipat three levels: inspiration, aspiration, and ambition. They represent different(but not necessarily exclusive) mindsets or modi operandi. Let me start withthe Merriam–Webster Dictionary definitions of the three words.(a) Inspiration is “the action or power of moving the intellect or emotions.”In its religious origin, inspiration can be described as “a divine influence oraction... to qualify him/her to receive and communicate sacred revelation.”It works at the spiritual level even in describing work or career.(b) Aspiration is “a strong desire to achieve something high or great.” It hasa more concrete aim than inspiration but still retains an idealistic element.(c) Ambition is “the desire to achieve a particular end” or “an ardent desirefor rank, fame, or power.” It has a utilitarian connotation and is the mostpractical of the three. Ambition can be good when it drives us to excel,but it can also have a negative effect. Aspiration, being between the two,is more difficult to delineate.Before I go on, I would like to bring your attention to a convocation speech(Wu, 2008) entitled “Idealism or pragmatism” that I gave in 2008 at theUniversity of Waterloo. This speech is reproduced in the Appendix. Why orhow is this related to the main theme of this chapter? Idealism and pragmatismare two ideologies we often use to describe how we approach life or work.They represent different mindsets but are not mutually exclusive. Inspirationis clearly idealistic, ambition has a pragmatic purpose, and aspiration can befound in both. The speech can be taken as a companion piece to this chapter.565


566 Inspiration, aspiration, ambition46.2 Examples of inspiration, aspiration, and ambitionTo see how inspiration, aspiration, and ambition work, I will use examplesin the statistical world for illustration. Jerzy Neyman is an embodiment ofall three. Invention of the Neyman–Pearson theory and confidence intervals isclearly inspirational. Neyman’s success in defending the theory from criticismby contemporaries like Sir Ronald A. Fisher was clearly an act of aspiration.His establishment of the Berkeley Statistics Department as a leading institutionof learning in statistics required ambition in addition to aspiration.The personality of the individual often determines at what level(s) he/sheoperates. Charles Stein is a notable example of inspiration as evidenced by hispioneering work in Stein estimation, Stein–Chen theory, etc. But he did notpossess the necessary attribute to push for his theory. It is the sheer originalityand potential impact of his theoretical work that helped his contributionsmake their way to wide acceptance and much acclaim.Another example of inspiration, which is more technical in nature, is theCooley–Tukey algorithm for the Fast Fourier Transform (FFT); see Cooleyand Tukey (1965). The FFT has seen many applications in engineering, science,and mathematics. Less known to the statistical world is that the coretechnical idea in Tukey’s development of the algorithm came from a totallyunrelated field. It employed Yates’ algorithm (Yates, 1937) for computing factorialeffects in two-level factorial designs.In Yates’ time, computing was very slow and therefore he saw the needto find a fast algorithm (in fact, optimal for the given problem) to ease theburden on mechanical calculators. About thirty years later, Tukey still feltthe need to develop a fast algorithm in order to compute the discrete Fouriertransform over many frequency values. Even though the stated problems aretotally different, their needs for faster algorithm (relative to the technology intheir respective times) were similar. By some coincidence, Yates’ early worklent a good hand to the later development of the FFT.As students of the history of science, we can learn from this example. Ifwork has structural elegance and depth, it may find good and unexpectedapplications years later. One cannot and should not expect an instant gratificationfrom the work. Alas, this may come too late for the ambitious.Examples of ambition without inspiration abound in the history of science.Even some of the masters in statistics could not stay above it. Here are twoexamples. In testing statistical independence in r × c contingency table, KarlPearson used rc − 1 as the degrees of freedom. Fisher showed in 1922 that,when the marginal proportions are estimated, the correct degrees of freedomshould be (r − 1)(c − 1). Pearson did not react kindly. He said in the sameyear “Such a view is entirely erroneous. [···] I trust my critic will pardon mefor comparing him with Don Quixote tilting at the windmill” (Pearson, 1922,p. 191). Fisher’s retort came much later. In a 1950 volume of his collected


C.F.J. Wu 567works, he wrote of Pearson: “If peevish intolerance of free opinion in others isa sign of senility, it is one which he had developed at an early age” (Fisher,1950). Even the greatest ever statistician could not be more magnanimous.46.3 Looking to the futureIn 2010, I gave a speech (Wu, 2010) whose main motivation was the discomfortingtrend I have witnessed in the last 15–20 years in the US and elsewhere.Compared to back when I started my career, there has been an increasing emphasison the number of papers, journal rankings, citations, and funding. Backthen, a new PhD could secure a tenure-track post in a top department withno paper published or accepted as long as the letters were good and the workin the thesis was considered to have good quality. Not anymore. We now seemost leading candidates in the applicant pool to have several papers in topjournals (and who ranks these journals?) Is this due to inflation or is the newgeneration really smarter or work harder than mine? Admittedly most of thetop departments still judge the candidates by the merits of the work. But thenew and unhealthy emphasis has affected the community by and large.There are some obvious culprits, mostly due to the environment we are in.The funding agencies give preference to large team projects, which require alarge number of papers, patents, etc. The widespread use of internet such asScientific Citation Index (SCI) and Google Scholar has led to instant comparisonsand rankings of researchers. Unfortunately this obsession with numericshas led to several widely used rankings of universities in the world. In manycountries (US being one lucky exception), university administrators pressureresearchers to go for more citations in order to boost their ranking.In the statistical world, some countries list the “Big Four” (i.e., The Annalsof Statistics, Biometrika, theJournal of the American Statistical Associationand the Journal of the Royal Statistical Society, Series B) as the mostdesirable journals for promotion and awards. The detrimental impact on thedevelopment of long lasting work is obvious but young researchers can’t affordto work or think long term. Immediate survival is their primary concern.What can be done to mitigate this negative effect? I am not optimisticabout the environment that spawned this trend. The widespread use of theinternet can only exacerbate the trend. I hope that the scientific establishmentand policy makers of countries that aspire to join the league of scientific powerswill soon realize that sheer numbers of papers or citations alone do not leadto major advances and discoveries. They should modify their reward systemsaccordingly.The leading academic departments, being good practitioners, bear a greatresponsibility in convincing the community not to use superficial numeric measures.At the individual level, good education at an early stage can help. Pro-


568 Inspiration, aspiration, ambitionfessors should serve as role models and advise students to go for quality overquantity. This theme should be featured visibly in conferences or sessions fornew researchers. My final advice for the aspiring young researchers is to alwayslook inward to your inspiration, aspiration, and ambition to plan andassess your work and career.Appendix: Idealism or pragmatism (Wu, 2008)Let me join my colleague Professor Rao in thanking the University, the Chancellor,and the President for bestowing such an honor upon us and in congratulatingthis year’s graduating class for their hard work and achievements.Since I am younger, Jon said that I should stand here longer [laugh].Iwouldliketosharesomethoughtsonourresponsibilitiestothesocietyand to ourselves. When I was a bit younger than you are now, I faced two distinctchoices: medicine and pure mathematics. This was Taiwan in the 1960s,and most of my relatives urged my parents to nudge me toward medicine, notbecause they thought I would make a good doctor, but because it would providea secure career and high income. Thanks to my parents, though, I wasable to follow my passion and pursue mathematics. At that time I did notconsider the financial consequences because I enjoyed doing math. I am nothere to suggest that you follow this romanticism in your career planning —in fact many of you have probably lined up some good jobs already [laugh].Rather, I want to discuss the role of idealism and pragmatism in our lives.At the many turning points of our lives, we are often faced with choosingone or the other. Most will heed the call of pragmatism and shun idealism.For example, some of us may find that we disagree with a policy or decision atwork. Yet it will be our job to follow or implement this policy. A pragmatistwould not go out of his way to show disapproval, even if this policy goesagainst his conscience. On the other hand, an idealist in this situation is likelyto show her disapproval, even if it puts her livelihood at risk.One of the most shining examples of idealism, of course, is Nelson Mandela,who fought for freedom in South Africa. Apartheid was designed to intimidateminorities into submission. Even something as simple as membership in a legalpolitical organization could lead to consequences such as loss of income andpersonal freedom. Knowing these risks fully well, Mandela and countless othersembarked on that freedom struggle, which lasted for decades.While I do not expect or suggest that many can follow the most idealisticroute, pragmatism and idealism are not incompatible. For example,researchers can channel their efforts to finding new green energy solutions.Even humble statisticians like us can help these environmental researchersdesign more efficient experiments. Successful business people, which many of


C.F.J. Wu 569you will become, can pay more attention to corporate social responsibility,and not focus exclusively on the bottom line.Perhaps it is naive, but I truly believe that most often we can strike abalance between what is good for others and what is good for us. If we cankeep this spirit and practice it, the world will be a much better and morebeautiful place!Thank you for your attention and congratulations to you once again.ReferencesCooley, J.W. and Tukey, J.W. (1965). An algorithm for the machine calculationof complex Fourier series. Mathematics of Computation, 19:297–301.Fisher, R.A. (1950). Contributions to Mathematical Statistics. Chapman&Hall, London.Pearson, K. (1922). On the χ 2 test of goodness of fit. Biometrika, 14:186–191.Wu, C.F.J. (2008). Convocation speech delivered on June 13, 2008 at thegraduation ceremony of the Faculty of Mathematics, University of Waterloo,Ontario, Canada.Wu, C.F.J. (2010). Plenary talk given on October 22, 2010 during the triennialChinese Conference in Probability and Statistics held at NankaiUniversity, Tianjin, People’s Republic of China.Yates, F. (1937). The Design and Analysis of Factorial Experiments. ImperialBureau of Soil Sciences, Tech. Comm. No. 35.


47Personal reflections on the COPSSPresidents’ AwardRaymond J. CarrollDepartment of StatisticsTexas A&M University, College Station, TX47.1 The facts of the awardI received the COPSS Presidents’ Award in 1988, one year after Jeff Wu andone year before Peter Hall. I was the eighth recipient.I had just moved to Texas A&M in fall 1987, and did not know that CliffSpiegelman, also still at Texas A&M, had nominated me. I remember veryclearly being told about this honor. I was working at home, and out of theblue I received a phone call from the head of the Presidents’ Award Committee,about a month before the Joint Statistical Meetings in New Orleans, and heasked, roughly, “are you going to the JSM in New Orleans?” I actually hadnot planned on it, and he told me I probably should since I had won thePresidents’ Award. I remember being very happy, and I know I took the restof the day off and just floated around.I am not by nature a very reflective person, preferring instead to lookahead and get on with the next project. However, the invitation to write forthe COPSS 50th Anniversary Book Project motivated me to reflect a bit onwhat I had done prior to 1988, and if there were any morals to the story thatIcouldshare.47.2 PersistencePersistence I have. My first six submitted papers were rejected, and some ina not very nice way. Having rejected two of my papers, the then-Editor ofThe Annals of Statistics wrote to tell me that he thought I had no possibilityof a successful career in academics, and that I would be better off going into571


572 Personal reflections on the COPSS Presidents’ Awardindustry; at the time, this was a grand insult from an academic. The onlyeffect that had on me was that I worked harder, and I swore that one day that# # &# Editor would invite me to give a talk at his university, which finallyhappened in 1990. By then, I had calmed down.Over the years, editors have become less judgmental, but it is easy to forgethow devastating it can be for a new PhD to have his/her thesis paper rejected.Ihaveseenverytalentedstudentswholeftacademiaaftertheirthesispaperhad its initial rejection. As the Editor of the “Theory and Methods” Sectionof the Journal of the American Statistical Association (JASA) and then laterBiometrics, I was always on the lookout for new PhD’s, and would work toget their papers published, even in very abbreviated form.Unfortunately nowadays, it is routine in some sub-areas of statistics tosee applicants for an initial appointment at the Assistant Professor level withapproximately six papers in top journals, without a postdoc. I remain skepticalthat this is really their work. In any case, this makes it harder to be generous,and it has led to a bit of a coarsening effect in the review of the first paper ofmany new PhD’s.47.3 Luck: Have a wonderful Associate EditorWe all know that many Associate Editors are merely mailboxes, but by nomeans all. My experience is that the quality of Associate Editors is a stationaryprocess. I wrote a paper in 1976 or so for The Annals of Statistics which hada rather naive proof about expansions of what were then called M-estimators(I guess they still are). The Associate Editor, who knew I was a new AssistantProfessor and who I later found out was Willem van Zwet, to whom I remaingrateful, wrote a review that said, in effect, “Nice result, it is correct, but toolong, here is a two-page proof.” The paper appeared (Carroll, 1978), and foryears I wondered who my benefactor was. I was at a conference at Purdueabout 15 years later, and Bill came up to me and said, and this is a quote“I wrote a better paper than you did.” He was right: the published paper isfive pages long!47.4 Find brilliant colleaguesI was and am extremely lucky in my choice of colleagues, and there is bothplan and serendipity in this. A partial list of collaborators includes Presidents’Award winners Ross Prentice, Jeff Wu, Peter Hall, Kathryn Roeder, JianqingFan, Xihong Lin, and Nilanjan Chatterjee. Anyone who writes papers with


R.J. Carroll 573them is, by definition, brilliant! However, from 1974 to 1984, the numberof statistical coauthors of what I consider good papers totaled exactly one,namely David Ruppert (now at Cornell). Even by 1988, after sabbaticals andthe Presidents’ Award, for what I considered really good papers, the totalnumber of statistical coauthors who were not my students was only eight. Toappreciate how the world has changed, or how I have changed, from 2012 untilnow my serious statistics papers not with students or postdocs have had 26statistical coauthors! It is an amazing thing to reflect upon the massive changein the way statistical methodology has evolved.IreceivedmyPhDin1974,andby1978Iwastenured.Ihadwrittenanumber of sole-authored papers in major journals, but I was not very satisfiedwith them, because I seemed, to my mind, to be lurching from one technicalpaper to the next without much of a plan. It just was not much fun, andI started toying with the idea of going to medical school, which seemed a lotmore interesting, as well as more remunerative.My statistical world changed in the fall of 1977, when David Ruppertbecame my colleague. It was a funny time that fall because both of our wiveswere away working on postdocs/dissertations, and our offices were next toone another in a corner. We became friends, but in those days people did notnaturally work together. That fall I had a visitor and was trying to understandatopicthatatthetimewasfashionableinrobuststatistics:whathappenstothe least squares estimator obtained after deleting a percentage of the datawith the largest absolute residuals from an initial least squares fit? This wasperceived wisdom as a great new statistical technique. David heard us talkingabout the problem, came in to participate, and then my visitor had to leave.So, David and I sat there staring at the blackboard, and within two hourswe had solved the problem, other than the technical details (there went twomonths). It was fun, and I realized that I did not much like working alone, butwanted to share the thrill of discovery, and pick other people’s brains. Thebest part was that David and I had the same mentality about methodology,but his technical skill set was almost orthogonal to mine.The paper (Carroll and Ruppert, 1980) also included some theory aboutquantile regression. The net effect was that we showed that trimming somelarge residuals after an initial fit is a terrible idea, and the method quicklydied the death that it deserved. A fun paper along these same lines is He andPortnoy (1992). The paper also had my first actual data set, the salinity datain the Pamlico Sound. We published the data in a table, and it has made itsway into numerous textbooks, but without a citation! In the nine years wewere colleagues, we wrote 29 papers, including two papers on transformationof data (Carroll and Ruppert, 1981, 1984). Overall, David and I are at 45 jointpapers and three books. It is trite to give advice like “Get lucky and find abrilliant colleague at the rarified level of David Ruppert,” but that’s whatI did. Lucky I am!


574 Personal reflections on the COPSS Presidents’ Award47.5 Serendipity with dataJust before David and I became colleagues, I had my first encounter withdata. This will seem funny to new researchers, but this was in the era of nopersonal computers and IBM punch cards.It was late 1976, I was the only Assistant Professor in the department, andIwassittinginmyofficehappilymindingmyownbusiness,whentwoverysenior and forbidding faculty members came to my office and said “Carroll,come here, we want you to meet someone” (yes, in the 1970s, people reallytalked like that, especially at my institution). In the conference room was avery distinguished marine biologist, Dirk Frankenberg (now deceased), whohad come over for a consult with senior people, and my colleagues said, ineffect, “Talk to this guy” and left. He was too polite to say “Why do I wantto talk to a 26-year old who knows nothing?” but I could tell that was whathe was thinking.Basically, Dirk had been asked by the North Carolina Department of Fisheries(NCDF) to build a model to predict the shrimp harvest in the PamlicoSound for 1977 or 1978, I forget which. The data, much of it on envelopes fromfishermen, consisted of approximately n = 12 years of monthly harvests, withroughly four time periods per year, and p = 3 covariates: water temperaturein the crucial estuary, water salinity in that estuary, and the river dischargeinto the estuary, plus their lagged versions. I unfortunately (fortunately?) hadnever taken a linear model course, and so was too naive to say the obvious:”You cannot do that, n is too small!” So I did.In current lingo, it is a “small p, small n,” the very antithesis of what ismeant to be modern. I suspect 25% of the statistical community today wouldscoff at thinking about this problem because it was not “small n, largep,” butit actually was a problem that needed solving, as opposed to lots of what isgoing on. I noticed a massive discharge that would now be called a high leveragepoint, and I simply censored it at a reasonable value. I built a model, andit predicted that 1978 (if memory serves) would be the worst year on record,ever (Hunt et al., 1980), and they should head to the hills. Dirk said “Are yousure?” and me in my naïveté said “yes,” and like a gambler, it hit: it was theterrible year. The NCDF then called it the NCDF model! At least in our reportwe said that the model should be updated yearly (my attempt at full employmentand continuation of the research grant), but they then fired us. Themodel did great for two more years (blind luck), then completely missed thefourth year, wherein they changed the title of the model to reflect where I wasemployed at the time. You can find Hunt et al. (1980) at http://www.stat.tamu.edu/~carroll/2012.papers.directory/Shrimp_Report_1980.pdf.This is a dull story, except for me, but it also had a moral: the data wereclearly heteroscedastic. This led me to my fascination with heteroscedasticity,which later led to my saying that “variances are not nuisance parameters”


R.J. Carroll 575(Carroll, 2003). In the transformation world, it led David and me to a paper(Carroll and Ruppert, 1984), and also led to 1/2 of our first book: Transformationand Weighting in Regression. Dirklatersetusuponanotherproject,managing the Atlantic menhaden fishery, with a brilliant young colleague of hisnamed Rick Deriso, now Chief Scientist, Tuna-Billfish Program, at the Inter-American Tropical Tuna Commission (IATTC) (Reish et al., 1985; Ruppertet al., 1984, 1985).47.6 Get fascinated: HeteroscedasticityFrom the experience with what I call the salinity data, I became fascinatedwith the concept of heteroscedasticity. I went on a sabbatical to Heidelberg in1980, and bereft of being able to work directly with David, started thinkinghard about modeling variances. I asked what I thought was an obvious question:can one do weighted least squares efficiently without positing a parametricmodel for the variances? David and I had already figured out thatthe common practice of positing a model for the variances as a function ofthe mean and then doing normal-theory maximum likelihood was non-modelrobustif the variance function was misspecified (Carroll and Ruppert, 1982).I sat in my nice office in Heidelberg day after day, cogitating on the problem.For a month I did nothing else (the bliss of no email), and wrote a paper(Carroll, 1982) that has many references. In modern terms it is not muchof a technical “tour de force,” and modern semiparametric statisticians haverecognized that this is a case where adaptation is obvious, but at the time itwas very new and surprising, and indeed a very, very senior referee did notthink it was true, but he could not find the flaw. The Editor of The Annals ofStatistics, to whom I am forever grateful for sticking up for me, insisted thatIwriteoutaverydetailedproof,whichturnedouttobeover100pagesbyhand, and be prepared to make it available, make a copy and send it along.I still have it! The paper is mostly cited by econometricians, but it was fun.Later, with my then student Marie Davidian, currently the ASA President,we worked out the theory and practice of parametric variance functionestimation (Davidian and Carroll, 1987).47.7 Find smart subject-matter collaboratorsIn many fields of experimental science, it is thought that the only way to advanceis via solving a so-called “major” problem. In statistics though, “major”problems are not defined apriori.Iftheyexist,thenmanyverysmartpeople


576 Personal reflections on the COPSS Presidents’ Awardare working on them, which seems a counter-productive strategy to me. I oncespent a fruitless year trying to define something “major,” and all I ended upwith was feeling stupid and playing golf and going fishing.I now just float: folks come to me to talk about their problems, and I tryto solve theirs and see if there is a statistics paper in it.What I do like though is personal paradigm shifts when researchers wanderinto my office with a “simple” problem. This happened to me on a sabbaticalin 1981–82 at the National Heart, Lung and Blood Institute in Bethesda,Maryland. I was a visitor and all the regular statisticians had gone off for aretreat, and one day in walks Rob Abbott, one of the world’s great cardiovascularepidemiologists (with a PhD in statistics, so he speaks our language).He asked “are you a statistician,” I admitted it (I never do at a party), and hewanted to talk with someone about a review he had gotten on a paper aboutcoronary heart disease (CHD) and systolic blood pressure (SPB). If you go toa doctor’s office and keep track of your measured SBP, you will be appalledabout the variability of it. My SBP has ranged from 150 to 90 in the past threeyears, as an example. A referee had asked “what is the effect of measurementerror in SBP on your estimate of relative risk of CHD?” In the language ofcurrent National Football League beer commercials, I said “I love you guy.”I will quote Larry Shepp, who “discovered” a formula that had been discoveredmany times before, and who said “yes, but when I discovered it, it stayeddiscovered!” You can find this on the greatest source of statistics information,Wikipedia.Iwasconvincedatthetime(Ihavesincefoundoutthisisnotexactlytrue)that there was no literature on nonlinear models with measurement error. So,I dived in and have worked on this now for many years. The resulting paper(Carroll et al., 1984), a very simple paper, has a fair number of citations, andmany papers after this one have more. How many times in one’s life does astranger wander in and say “I have a problem,” and you jump at it?Actually, to me, this happens a lot, although not nearly with the same consequences.In the late 1990s, I was at a reception for a toxicological researchcenter at Texas A&M, and feeling mighty out of place, since all the lab scientistsknew one another and were doing what they do. I saw a now long-termcolleague in Nutrition, Nancy Turner, seeming similarly out of place. I wanderedover, asked her what she did, and she introduced me to the world ofmolecular biology in nutrition. She drew a simple little graph of what statisticiansnow call “hierarchical functional data,” and we have now written manypapers together (six in statistics journals), including a series of papers onfunctional data analysis (Morris et al., 2001; Morris and Carroll, 2006).


R.J. Carroll 57747.8 After the Presidents’ AwardSince the COPSS Award, my main interests have migrated to problems in epidemiologyand statistical methods to solve those problems. The methods includedeconvolution, semiparametric regression, measurement error, and functionaldata analysis, which have touched on problems in nutritional epidemiology,genetic epidemiology, and radiation epidemiology. I have even becomea committed Bayesian in a fair amount of my applied work (Carroll, 2013).Ihavefoundproblemsinnutritionalepidemiologyparticularlyfascinating,because we “know” from animal studies that nutrition is important incancer, but finding these links in human longitudinal studies has proven tobe surprisingly difficult. I remember an exquisite experiment done by JoanneLupton (now a member of the US Institute of Medicine) and Nancy Turnerwhere they fed animals a diet rich in corn oil (the American potato chip diet)versus a diet rich in fish oil, exposed them to a carcinogen, and within 12hours after exposure all the biomarkers (damage, repair, apoptosis, etc.) litup as different between the two diets, with corn oil always on the losing end.When the microarray became the gold standard, in retrospect a sad and veryfunny statement, they found that without doing anything to the animals, 10%of the genes were different at a false discovery rate of 5%. Diet matters!There are non-statisticians such as Ed Dougherty who think the field ofstatistics lost its way when the microarray came in and thinking about hypotheses/epistemologywent out (Dougherty, 2008; Dougherty and Bittner,2011): “Does anyone really believe that data mining could produce the generaltheory of relativity?” I recently had a discussion with a very distinguishedcomputer scientist who said, in effect, that it is great that there are manycomputer scientists who understand (Bayesian) statistics, but would it not begreat if they understood what they are doing scientifically? It will be veryinteresting to see how this plays out. Statistical reasoning, as opposed to computation,while not the total domain of statisticians, seems to me to remaincrucial. To quote from Dougherty and Bittner (2011),“The lure of contemporary high-throughput technologies is that theycan measure tens, or even hundreds, of thousands of variables simultaneously,thereby spurring the hope that complex patterns of interactioncan be sifted from the data; however, two limiting problems immediatelyarise. First, the vast number of variables implies the existenceof an exponentially greater number of possible patterns in the data,the majority of which likely have nothing to do with the problem athand and a host of which arise spuriously on account of variation inthe measurements, where even slight variation can be disastrous owingto the number of variables being considered. A second problem is thatthe mind cannot conceptualize the vast number of variables. Soundexperimental design constrains the number of variables to facilitate


578 Personal reflections on the COPSS Presidents’ Awardfinding meaningful relations among them. Recall Einstein’s commentthat, for science ‘truly creative principle resides in mathematics.’ Thecreativity of which Einstein speaks resides in the human mind. Thereappears to be an underlying assumption to data mining that the mindis inadequate when it comes to perceiving salient relations among phenomenaand that machine-based pattern searching will do a better job.This is not a debate between which can grope faster, the mind or themachine, for surely the latter can grope much faster. The debate isbetween the efficacy of mind in its creative synthesizing capacity andpattern searching, whether by the mind or the machine.”What success I have had comes from continuing to try to find researchproblems by working on applications and finding important/interesting appliedproblems that cannot be solved with existing methodology. I am spendingmuch of my time these days working on developing methods for dietarypatterns research, since nutritional epidemiologists have found that dietarypatterns are important predictors of cancer. I use every tool I have, and engagemany statistical colleagues to help solve the problems.ReferencesCarroll, R.J. (1978). On almost sure expansion for M-estimates. The Annalsof Statistics, 6:314–318.Carroll, R.J. (1982). Adapting for heteroscedasticity in linear models. TheAnnals of Statistics, 10:1124–1233.Carroll, R.J. (2003). Variances are not always nuisance parameters: The 2002R.A. Fisher lecture. Biometrics, 59:211–220.Carroll, R.J. (2014). Estimating the distribution of dietary consumption patterns.Statistical Science, 29:inpress.Carroll, R.J and Ruppert, D. (1980). Trimmed least squares estimation in thelinear model. Journal of the American Statistical Association,75:828–838.Carroll, R.J. and Ruppert, D. (1981). Prediction and the power transformationfamily. Biometrika, 68:609–616.Carroll, R.J. and Ruppert, D. (1982). A comparison between maximum likelihoodand generalized least squares in a heteroscedastic linear model.Journal of the American Statistical Association, 77:878–882.Carroll, R.J. and Ruppert, D. (1984). Power transformations when fittingtheoretical models to data. Journal of the American Statistical Association,79:321–328.


R.J. Carroll 579Carroll, R.J., Spiegelman, C.H., Lan, K.K.G., Bailey, K.T., and Abbott, R.D.(1984). On errors-in-variables for binary regression models. Biometrika,71:19–26.Davidian, M. and Carroll, R.J. (1987). Variance function estimation. Journalof the American Statistical Association, 82:1079–1092.Dougherty, E.R. (2008). On the epistemological crisis in genomics. CurrentGenomics, 9:67–79.Dougherty, E.R. and Bittner, M.L. (2011). Epistemology of the Cell: A SystemsPerspective on Biological Knowledge. Wiley–IEEEPress.He, X. and Portnoy, S. (1992). Reweighted ls estimators converge at the samerate as the initial estimator. The Annals of Statistics, 20:2161–2167.Hunt, J.H., Carroll, R.J., Chinchilli, V., and Frankenberg, D. (1980). Relationshipbetween environmental factors and brown shrimp production inPamlico Sound, North Carolina. Report to the Division of Marine Fisheries,North Carolina Department of Natural Resources.Morris, J.S. and Carroll, R.J. (2006). Wavelet-based functional mixed models.Journal of the Royal Statistical Society, Series B, 68:179–199.Morris, J.S., Wang, N., Lupton, J.R., Chapkin, R.S., Turner, N.D., Hong,M.Y., and Carroll, R.J. (2001). Parametric and nonparametric methodsfor understanding the relationship between carcinogen-induced DNAadduct levels in distal and proximal regions of the colon. Journal of theAmerican Statistical Association, 96:816–826.Reish, R.L., Deriso, R.B., Ruppert, D., and Carroll, R.J. (1985). An investigationof the population dynamics of Atlantic menhaden (Brevoortiatyrannus). Canadian Journal of Fisheries and Aquatic Sciences, 42:147–157.Ruppert, D., Reish, R.L., Deriso, R.B., and Carroll, R.J. (1984). Monte Carlooptimization by stochastic approximation, with application to harvestingof Atlantic menhaden. Biometrics, 40:535–545.Ruppert, D., Reish, R.L., Deriso, R.B., and Carroll, R.J. (1985). A stochasticmodel for managing the Atlantic menhaden fishery and assessing managerialrisks. Canadian Journal of Fisheries and Aquatic Sciences, 42:1371–1379.


48Publishing without perishing and othercareer adviceMarie DavidianDepartment of StatisticsNorth Carolina State University, Raleigh, NCIn my 25-plus years as an academic statistician, I have had the good fortuneof serving in a variety of roles, including as statistical consultant and collaborator,Editor, Chair of grant review panels, and organizer of and participantin workshops for junior researchers. Drawing on this experience, I share mythoughts and advice on two key career development issues: Balancing researchwith demands on one’s time and on the importance of cultivating and developingone’s communication skills.48.1 IntroductionA career in statistical research is both exciting and challenging. Contributingto the advance of knowledge in our field is extremely rewarding. However,many junior researchers report having difficulty balancing this objectivewith their many other responsibilities, including collaborative work on fundedprojects in other disciplines, teaching, and service. And rightly so — our fieldis unique in the sense that many of us are expected to engage in methodologicalresearch in our own discipline and to contribute to research in otherdisciplines through participation in substantive projects. The latter, alongwith instructional and service responsibilities, can be difficult to navigate foryoung researchers in the first few years of their careers.Also unique to our field is the need for outstanding ability to communicateeffectively not only with each other but across diverse disciplinary boundaries.To meet this dual challenge, we must be excellent writers and speakers. Publishingour own work, writing successful grant applications in support of ourresearch, and assisting our collaborators with communicating the results of581


582 Publishing without perishingtheir research and applying for funding is a critical skill that we statisticiansmust develop.This volume commemorating the 50th anniversary of the Committee ofPresidents of Statistical Societies (COPSS) presents an excellent opportunityfor me to share my experience, for the most part learned the hard way, onbalancing the competing demands we face and on being an effective communicator.As you’ll see in the next two sections, it took me some time in myown career to develop these skills. Despite my slow start, I have subsequentlybeen very fortunate to have served as a journal editor, a chair of NIH grant reviewpanels, and as a consulting and collaborating statistician, through whichI have learned a great deal about both of these topics.With many colleagues over the past decade, including authors of someother chapters of this book, I have served as a senior participant in whatis now called the ENAR Workshop for Junior Biostatisticians in Health Research,which has been supported by grants from the National Institutes ofHealth. (Xihong Lin and I wrote the first grant application, and the granthas subsequently been renewed under the expert direction of Amy Herring.)Although targeted to biostatisticians, this workshop covers skills that are essentialto all young researchers. Much of what I have to say here has beenshaped by not only my own career but by the insights of my fellow seniorparticipants.48.2 Achieving balance, and how you never knowEmbarking on a career as a statistical researcher can be daunting, probablyconsiderably more so today than it was for me back in 1987. I had just receivedmy PhD in statistics from the University of North Carolina at Chapel Hill andhad accepted a position in the Department of Statistics at North CarolinaState University, barely 25 miles away. I was excited to have the opportunityto become a faculty member and to teach, consult with other scientists oncampus, and carry out statistical methods research.And, at the same time, I was, frankly, terrified. Sure, I’d done well in graduateschool and had managed to garner job offers in several top departments.But could I really do this? In particular, could I really do research?Iwasextremelyfortunatetohavehadathesisadvisor,RayCarroll,whowas what we would call today an outstanding mentor. Ray had not onlyintroduced me to what at the time was a cutting-edge methodological areathrough my dissertation research, he had also been a great role model. I’ll tellyou more about Ray in the next section. He seemed confident in my prospectsfor success in academia and urged me to forge ahead, that I would do just fine.But I couldn’t help being plagued by self-doubt. While I was in graduateschool, Ray was always there. He proposed the area in which I did my disserta-


M. Davidian 583tion research. He was available to help when I got stuck and to discuss the nextstep. And now I was supposed to do this all on my own? Moreover, I didn’tknow the first thing about collaborating with scientists in other disciplines.The first year wasn’t easy as I made the transition from student in a verytheoretical department to faculty member in a much more applied one, in aposition in which I was expected to serve as a consultant to faculty in otherdepartments across campus. It was a bit like a “trial by fire” as I struggled tolearn what is truly the art of being a good applied statistician and collaborator,a skill light-years removed from my training in Chapel Hill. Simultaneously, asthe papers from my dissertation were accepted, the realization that I neededto move forward with research loomed. Sure, I had some extensions of mydissertation work I was pursuing, but after that, what would I do? I couldn’tkeep doing variance function estimation forever. The amount of time I spenton my mostly routine but extensive statistical consulting and teaching thetwo-semester sequence for PhD students in agriculture and life sciences leftme little time to ponder new research problems. To top it off, I was asked toserve on the university’s Undergraduate Courses and Curriculum Committee,and, not knowing any better, I agreed. I now know that a faculty member asjunior as I was should not be asked to serve on a committee that meets everyfew weeks for several hours and focuses solely on administrative activitiescompletely tangential to research or collaboration.I will admit to spending many evenings sitting on the balcony of myRaleigh apartment looking out at the parking lot and wondering how I wouldever compile a record worthy of promotion and tenure a scant six years later.But the most amazing thing happened. A student in the Department ofCrop Science who was taking my statistics course approached me after classand asked if she could make an appointment to discuss her dissertation research,which involved development of a new, experimental strain of soybean.She had conducted a field experiment over the last three growing seasons inwhich she had collected longitudinal data on measures of plant growth of boththe experimental and a commercial strain and was unsure of how to conductan analysis that would address the question of whether the two competingsoybean varieties had different specific features of their growth patterns. Thegrowth trajectories showed an “S-shaped” pattern that clearly couldn’t bedescribed by regression models she knew, and it did not seem to her thatanalysis of variance methods would address the questions. Could I help? (Ofcourse, at that point I could have no input on the design, which is sadly stilloften the case to this day, but luckily the experiment had been well-designedand conducted.)At about this same time, I regularly had been bemoaning my feelings ofinadequacy and being overwhelmed to my good friend David Giltinan, whohad graduated from Chapel Hill three years ahead of me and taken a job innonclinical research at the pharmaceutical company Merck. David recognizedthat some of my dissertation research was relevant to problems he was seeingand introduced me to the subject-matter areas and his collaborators. He had


584 Publishing without perishingbegun working with pharmacokineticists and insisted that I needed to learnabout this field and that I could have a lot to contribute. I grudgingly agreedto look at some of the papers he sent to me.So what does this have to do with soybeans? Everything. As it turnedout, despite the disparate application areas, the statistical problem in boththe soybean experiment and pharmacokinetics was basically the same. Longitudinaltrajectories that exhibited patterns that could be well-described inmodels nonlinear in parameters arising from solutions to differential equationsbut where obviously the parameters took on different values across plants orsubjects. Questions about the typical behavior of specific features of the trajectoriesand how variable this is across plants of subjects and changes systematicallywith the characteristics of the plants or subjects (like strain, weight,or kidney function). And so on.These two chance events, a consulting client needing help analyzing datafrom a soybean experiment and a friend insisting I learn about an applicationarea I previously did not even know existed, led me to a fascinating andrewarding area of methodological research. The entire area of nonlinear mixedeffects modeling and analysis, the groundwork for which had been laid mostlyby pharmacokineticists in their literature, was just being noticed by a fewstatisticians. Fortuitously, David and I were among that small group. Theneed for refinement, new methodology, and translation to other subject matterareas (like crop science) was great. I went from fretting over what to work onnext to frustration over not having enough time to pursue simultaneously allthe interesting challenges to which I thought I could make a contribution.My determination led me to figure out how to make the time. I’d foundanichewhereIknew Icoulddousefulresearch,whichwouldhaveneverhappened had I not been engaged in subject-matter challenges through myconsulting and friendship with David. I was no longer sitting on the balcony;instead, I spent some of those evenings working. I realized that I did nothave to accommodate every consulting client’s preferred meeting time, andI adopted a firm policy of blocking off one day per week during which I wouldnot book consulting appointments or anything else, no matter what. And whenmy term on the university committee concluded, I declined when approachedabout a similar assignment.To make a long story short, I am proud that David and I were amongthe many statisticians who developed methods that brought nonlinear mixedeffects models into what is now routine use. Our most exciting achievementwas when John Kimmel, who was then a Statistics Editor with Chapman &Hall, approached us about writing a book on the topic. Write a book? Thathad not dawned on either of us (back then, writing a book on one’s researchwas much less common than it is today). For me, was this a good idea, givenI would be coming up for tenure in a year? Writing a book is a significant, timeconsumingundertaking; would this be a sensible thing to do right now? Asscared as I was about tenure, the opportunity to work with David on puttingtogether a comprehensive account of this area, all in one place, and make


M. Davidian 585it accessible to practitioners and researchers who could benefit, was just toocompelling. As it turned out, I ended up leaving North Carolina for Harvard(for personal reasons) shortly after we agreed to do the book, but I made it apriority and continued my policy of blocking off a day to work on it, despitebeing a clinical trials statistician with 13 protocols — I did waffle a few times,but for the most part I simply made it clear I was not available for conferencecalls or meetings on that day. Remarkably, we stuck to our vow of completingthe book in a year (and I did get tenure and eventually moved back to NorthCarolina State). Our book (Davidian and Giltinan, 1995), although it is nowsomewhat outdated by all the advances in this area that have followed it, isone of my most satisfying professional accomplishments to this day.As I said at the outset, starting out in a career in statistical research canbe overwhelming. Balancing so many competing demands — collaborativeprojects, consulting, teaching, methodological research, committee responsibilities— is formidable, particularly for new researchers transitioning fromgraduate school. It may seem that you will never have enough time for methodologicalresearch, and you may even find identifying worthy research problemsto be challenging, like I did. My story is not unique, and it taught me manylessons, on which I, along with my fellow senior participants in the ENARjunior researchers workshop, have dispensed advice over the years. Here arejust a few of the key points we always make.Number 1: Set aside time for your own interests, no matter what. It canbe an entire day or an afternoon, whatever your position will permit. Put itin your calendar, and block it off. And do not waver. If a collaborator wantsto schedule a meeting during that time, politely say that you are alreadycommitted. Your research is as important as your other responsibilities, andthus merits dedicated time, just as do meetings, teaching, and committeeactivities.Along those same lines, learn that it is okay to say “no” when the alternativeis being over-committed. Do not agree to take on new projects orresponsibilities unless you are given time and support commensurate with thelevel of activity. If you are in a setting in which statisticians are asked to bepart of a project for a percentage of their time, insist on that percentage beingadequate — no project will ever involve just five percent of your effort. Ifyou are being asked to serve on too many departmental or university committees,have an honest talk with your Department Chair to establish a realisticexpectation, and then do not exceed it.Finally, never pre-judge. When I set up the appointment with the cropscientist, I assumed it would be just another routine consulting encounter,for which I’d propose and carry out standard analyses and which would justadd to the pile of work I already had. When David insisted I learn aboutpharmacokinetics, I was skeptical. As statisticians engaged in collaboration,we will always do many routine things, and we eventually develop radar foridentifying the projects that will likely be routine. But you never know whenthat next project is going to reveal a new opportunity. And, as it did for


586 Publishing without perishingme, alter the course of one’s career. Be judicious to the extent that you can,but, unless you have very good reason, never write off anything. And neversay never. If you’d told me back in 1987 that I would have published a bestsellingbook a mere eight years later, I would have asked you what you hadbeen smoking!48.3 Write it, and write it againI’d always liked writing — in fact, when I was in high school, I toyed with theidea of being an English major. But my love of math trumped that idea, andI went on to major first in mechanical engineering at the University of Virginia,and then, realizing I was pretty bored, switched to applied mathematics. Itwas in the last semester of my senior year that, by chance, I took a statisticscourse from a relatively new Assistant Professor named David Harrington.I was hooked, and Dave was such a spectacular instructor that I ended uphanging around for an additional year and getting a Master’s degree. BecauseI was in an Applied Mathematics Department in an Engineering School backin 1980, and because there was no Statistics Department at UVa back then,I ended up taking several courses from Dave (one of the only statisticianson the entire campus), including a few as reading courses. You may knowof Dave — he eventually left Virginia for the Dana Farber Cancer Centerand the Department of Biostatistics at Harvard School of Public Health —and among his many other accomplishments, he wrote a best-selling book onsurvival analysis (Fleming and Harrington, 1991) with his good friend fromgraduate school, Tom Fleming.ImentionDavebecausehewasthefirsttoevertalktomeexplicitlyaboutthe importance of a statistician being a good writer. I had to write a Master’sthesis as part of my degree program, and of course Dave was my advisor. Itwas mainly a large simulation study, which I programmed and carried out (andwhich was fun) — the challenge was to write up the background and rationale,the design of the simulations, and the results and their interpretation in aclear and logical fashion. I will always remember Dave’s advice as I set outto do this for the first time: “Write it, and write it again.” Meaning that onecan always improve on what one has written to make it more accessible andunderstandable to the reader. And that one should always strive to do this.It’s advice I give to junior researchers and my graduate students to this day.I learned a lot from Dave about clear and accessible writing through thatMaster’s thesis. And fortunately for me, my PhD advisor, Ray Carroll, pickedup where Dave left off. He insisted that I develop the skill of writing up resultsas I obtained them in a formal and organized way, so that by the time I hadto begin preparing my dissertation, I had a large stack of self-contained documents,neatly summarizing each challenge, derivation, and result. Ray always


M. Davidian 587emphasized the importance of clarity and simplicity in writing and speaking(which he demonstrated by editing everything I wrote for my dissertation andthe papers arising from it). His motto for writing a good journal article was“Tell ’em what you’ll tell ’em, tell ’em, and tell ’em what you told ’em.” Asyou’ll see shortly, I’ve adopted that one as a guiding principle as well.I learned a lot from both Dave and Ray that laid the groundwork formy own strong interest in effective scientific writing. I am certain that, hadI not had the benefit of their guidance, I would not have developed my ownskills to the point that I eventually had the opportunity to serve as a JournalEditor. In my three years as Coordinating Editor of Biometrics in 2000–02and my current role (since 2006) as Executive Editor, I have read and reviewedprobably well over 1000 papers and have seen the entire spectrum, from thosethat were a joy to read to those that left me infuriated. And ditto for my timespent on NIH study sections (grant review panels). During my many yearson what is currently the NIH Biostatistical Methods and Research Designstudy section, including three years as its Chair, I read grant applicationsthat were so clear and compelling that I almost wanted to write my ownpersonal check to fund them, but others that left me questioning the audacityof the investigators for expecting the taxpayers to support a project that theycould not even convincingly and clearly describe.What is it that makes one article or grant application so effective andanother one so dreadful? Of course, the methodological developments beingpresented must have a sound basis. But even if they are downright brilliantand path-breaking, if they are not communicated in a way that the intendedaudience can unambiguously understand, they are not going to be appreciated.Given what one has to say is worthy, then, it is the quality of the writing thatplays the primary role in whether or not a paper gets published or a grantgets funded. I’ll concentrate on writing here, but most of what I say can beadapted equally well to oral presentation.So how does one become a good writer? Admittedly, some people are justnaturally gifted communicators, but most of us must practice and perfect ourwriting skills. And they can be perfected! Here is a synopsis of the points I andmy colleagues stress to junior researchers when discussing effective writing ofjournal articles and grant applications.First and foremost, before you even begin, identify and understand yourtarget audience. If you are writing a journal article, you have two types of targetreaders. The Editor, Associate Editor, and referees at the journal, some ofwhom will be experts in the area and all of whom must be convinced of yourwork’s relevance and novelty; and, ultimately, readers of the journal, who mayspan the range from experts like you to others with a general background whoare hoping to learn something new. If you are writing a grant application, itis likely that many on the review panel will have only passing familiarity withyour area while a few will be experts. Your presentation must be accessibleto all of them, providing the novices with the background they need to understandyour work while communicating the key advances to experts who


588 Publishing without perishingalready have that background. And you need to do this while respecting anonnegotiable restriction on the length of your article or research proposal.That’s a pretty tall order.To address it, take some time and think carefully about the main messageyou want to convey and what you can reasonably hope to communicate effectivelyin the space allotted. You must acknowledge that you cannot packin everything that you’d like or give the full background. So step into theshoes of your different readers. What background is essential for a novice toappreciate the premise of your work? Some of this you may be able to reviewbriefly and explicitly, but most likely you will need to refer these readers toreferences where that background is presented. In that case, what referenceswould be most appropriate? What results would an expert be willing to acceptwithout seeing all the technical details (that might be more than a novicewould need to see anyway)? What aspects of your work would be the mostexciting to expert readers and should be highlighted, and which could be justmentioned in passing? Careful assessment of this will help you to establishwhat you must include to reach everyone and what you can omit or downplaybut still communicate the main message. For journal articles, the option ofsupplementary material allows you the luxury of presenting much more, butalways keep in mind that not all readers will consult it, so the main articlemust always contain the most critical material.Once you have an idea of your audience and what they should take away,the key is to tell the story. For a journal article, a good principle to follow isthe one Ray espouses; for a grant application, the format is more regimented,but the same ideas apply. First, “tell ’em what you’ll tell ’em!” The introductorysection to an article or proposal is often the hardest to write but themost important. This is where you motivate and excite all readers and givethem a reason to want to keep reading! The opening sentence should focusimmediately on the context of the work; for example, the renowned paper ongeneralized estimating equations by Liang and Zeger (1986) starts with “Longitudinaldata sets are comprised...,” which leaves no doubt in a reader’s mindabout the scope of the work. After setting the stage like this, build up thebackground. Why is the problem important? What are the major challenges?What is known? What are the limitations of current methods? What gaps inunderstanding need to be filled? For novice readers, note critical concepts andresults that must be understood to appreciate your work, and provide keyreferences where these readers may obtain this understanding. It is often veryhelpful to cite a substantive application that exemplifies the challenge (somejournals even require this); this may well be an example that you will returnto later to illustrate your approach.The next step is to “tell ’em.” You’ve made the case for why your audienceshould be interested in your story, now, tell it! Here, organization and logicalflow are critical. Organize your presentation into sections, each having a clearfocus and purpose that naturally leads to what follows. Completeness is critical;at any point along the way, the reader should have all the information


M. Davidian 589s/he needs to have followed up to that point. Motivate and describe the stepsleading to your main results, and relegate any derivations or side issues thatcould distract from the main flow of ideas to supplementary material for ajournal article (or don’t include them at all in a grant application). Relatecomplex concepts to concrete examples or simple special cases to assist novicereaders grasp the main ideas. This is especially effective in grant applications,where reviewers are likely not to be experts.The following principles seem obvious, but you would be surprised howoften authors violate them! Do not refer to ideas or concepts until after youhave introduced them. State your assumptions up front and before or whenyou need them for the first time. Do not use acronyms, terms, symbols, ornotation until after they have been defined; for that matter, be sure to defineevery acronym, term, and symbol you use. And only define notation you reallyneed. The less clutter and information a reader has to remember, the better.Be as clear, concise, and helpful as you can. With limited space, everysentence and equation counts and must be understandable and unambiguous.Avoid “flowery” words if simpler ones are available, and if you catch yourselfwriting long sentences, strive to break them into several. Paraphrase and interpretmathematical results in plain English to give a sense of what resultsmean and imply. Use a formal, scientific style of writing (different from thatused in this chapter). In particular, do not use contractions such as “it’s” and“don’t,” and use only complete sentences; although these constructions maybe used in a “popular” piece of writing like this one, they are not appropriatein scientific writing. Grammar and punctuation should be formal and correct(ask a colleague for help if English is not your native language), and be sure tospell check. Consult the articles in your target journal for examples of stylisticand grammatical conventions.When reporting empirical studies, be sure that everything a reader wouldneed to reproduce a simulation scenario him- or herself is presented. Donot display mind-numbing tables of numbers with little explanation; instead,choose to present limited results that illustrate the most important pointsand provide detailed interpretation, emphasizing how the results support thepremise of your story. In fact, consider if it is feasible to present some resultsgraphically, which can often be more efficient and effective than a tabularformat.In summary, do not leave your reader guessing! One useful practice toadopt is to step into your audience’s shoes often. Read what you have written,and ask yourself: “Would I be able to understand what comes next given whatI have presented so far?” Be honest, and you’ll identify ways you could do abetter job at conveying your message.You may not have this luxury in a grant application, but in a journal article,you do. Once you’ve told the story, “tell ’em what you told ’em.” Usually,this would be done in a final Discussion or Conclusions section. Restate whatyou set out to accomplish and review what was done to address it. Highlightthe key findings, and discuss their significance and impact. It is just as im-


590 Publishing without perishingportant to note the limitations of what you have done and to identify whatremains to be done. This summary does not have to be very long, but it shouldleave the reader with a clear understanding of why s/he bothered to read yourwork.How does one go about getting started? The prospect of sitting down towrite a journal article or research proposal can be daunting. Keep in mind thatauthors who simply sit down and start writing are rare. Most good writers,either literally, or figuratively in their minds, formulate an outline establishingthe basic organization. Some will do this by beginning a L A TEX documentwithtentative section headings that correspond to the main ideas and results, andthen filling in details. Also know that most good writers do not do this in order.They may write up the results first and leave the introduction until later, afterthey have a sense of what follows that material. If you find yourself graspingfor the right word and/or agonizing over a detail, do not allow yourself to getstuck; make a note to come back to it later. As your work takes shape, you’llrealize that you may want to move some material to another place, and thewords and details you struggled with begin to gel.Finally, I’ll return to Dave’s motto: “Write it, and write it again.” Nowriter, no matter how talented or skilled, produces a perfect first draft. Onceyou have your first draft, review it carefully and critically, and be ruthless!Evaluate every sentence, and make sure that it is really necessary and, if itis, that it says exactly what you mean. Be on the lookout for repetition andredundancy — have you repeated something that you said earlier that doesn’tbear repeating? This is a waste of precious space that could be put to betteruse. Be your own worst critic! Ask yourself: Have I cited the relevant literatureand background sufficiently? Are there gaps in my logic or storytellingthat would impede understanding? Are there parts that could be confusing orunclear? Is the overall message obvious, and have I made a convincing case?The bottom line is that you can always improve on what you have written.At some point, of course, you must let go and declare what you have to bethe finished product, but a few rounds of putting your work aside and readingit again in a day or two can be very effective toward refining it to the pointwhere any further improvements would be minimal.48.4 Parting thoughtsA career in statistical research can be incredibly rewarding. The challengesare many, but the skills required can be mastered. I’ve touched on just twokey elements — balancing competing responsibilities and effective writing —that have played a major role in my own career. I hope that my experience ishelpful to the next generation of statistical scientists, to whom I leave one finalpiece of advice. Have fun! In spite of the challenges and occasional frustrations,


M. Davidian 591enjoy what you do, and, if you don’t, look for a change. We in our field arelucky that, at least currently, our skills are in high demand. Regardless ofwhat type of position you find is right for you, becoming skilled at findingbalance and being a good communicator will always serve you well.ReferencesDavidian, M. and Giltinan, D.M. (1995). Nonlinear Models for Repeated MeasurementData. Chapman&Hall,London.Fleming, T.R. and Harrington, D.P. (1991). Counting Processes and SurvivalAnalysis. Wiley,NewYork.Liang, K.Y. and Zeger, S.L. (1986). Longitudinal data analysis using generalizedlinear models. Biometrika, 73:13–22.


49Converting rejections into positive stimuliDonald B. RubinDepartment of StatisticsHarvard University, Cambridge, MA“It’s not that I’m so smart, it’s just that I stay with problems longer.”– Albert EinsteinAt first glance this Einstein quotation may seem to have little to do withmy title, but those readers who know something of Einstein’s early life willrecall that these years were not full of recognized scientific successes, but hekept working on his problems. And that is certainly related to why I chosethe quote, but there is more to it. I have been fortunate to have had manyjournal publications, but less than one percent were accepted at first submission— far more were immediately rejected, followed closely by those thatwere rejected accompanied with the suggestion that it would not be wise toresubmit. However, I cannot think of an instance where this nasty treatmentof my magnificent (self-assessed) work (sometimes joint) did not lead to amarkedly improved publication, somewhere. In fact, I think that the draftsthat have been repeatedly rejected by many different journals possibly representmy best contributions! Certainly the repeated rejections, combined withmy trying to address various comments, led to better exposition and sometimesbetter problem formulation as well.So here, in an attempt to inspire younger researchers to stay the course,I’ll relay some of my stories on the topic, of course using some of my ownpublications as examples. I’ll give only a short summary of each example,hopefully just enough for the reader to get the basic idea of the work (orpossibly even read it, or as my wonderful PhD advisor, Bill Cochran, used tosay, “I’d prefer if you read it and understood it, but if not, please read it; failingthat, just cite it!”). For potential interest, I’ll insert the approximate numberof Google Scholar cites as of August 1, 2013. These counts may be of interestbecause the relationship between the number of citations and my memory ofthe paper’s ease of acceptance appears to me to be zero (excluding the EMoutlier). So young writers, if you think you have a good idea that reviewers donot appreciate, you’re not alone, and quite possibly on to a very good idea,593


594 Converting rejectionsespecially if the reviewers come across as real experts in their reports, butappear to have off-target comments.49.1 My first attempt“A non iterative algorithm for least squares estimation of missingvalues in any analysis of variance design.” Journal of the RoyalStatistical Society, Series C, vol. 21 (1972), pp. 136–141. [Number ofcitations: 58]This was my first sole-authored submission, and of course, I thought it wasvery clever, combining simple matrix manipulations with simple computationsto generalize an old “Rothamsted” (to use Cochran’s word) method to fill inmissing data in an experimental design with their least squares estimates —a standard objective in those days (see the target article or Little and Rubin(2002, Chapter 2), for the reason for this objective). When I submitted this,I was still a PhD student, and when I received the report and saw “tentativereject,” I was not a happy camper. Cochran calmed me down, and gave mesome advice that he learned as a wee Scottish lad on the links: Keep youreye on the ball! Meaning, the objective when writing is to communicate withyour readers, and the reviewers are making useful suggestions for improvedcommunication. He went on to say:“The Editor is not your enemy — at this point in time, he has noidea who you even are! The Editor sent your draft to people who aremore experienced than you, and they are reading it without pay tohelp you and the journal.”I was calm and the paper was accepted, a revision or two later. I was onlyfully calm, however, until the next “tentative reject” letter a few months later.49.2 I’m learning“Matching to remove bias in observational studies.” Biometrics, vol. 29(1973), pp. 159–183. Printer’s correction note in vol. 30 (1974), p. 728.[Number of citations: 392]“The use of matched sampling and regression adjustment to removebias in observational studies.” Biometrics, vol. 29 (1973), pp. 184–203.[Number of citations: 321]


D.B. Rubin 595This pair of submissions was based on my PhD thesis written under Bill’sdirection — back-to-back submissions, meaning both were submitted at thesame time, with the somewhat “aggressive” suggestion to publish them backto-backif they were acceptable. Both were on matched sampling, which at thetime was really an unstudied topic in formal statistics. The only publicationthat was close was the wonderful classic Cochran (1968). Once again, a tentativerejection, but this time with all sorts of misunderstandings, criticisms andsuggestions, that would take voluminous amounts of time to implement, andbecause at the time I was faculty in the department, I was a busy boy! I againtold Bill how furious I was about these reviews, and Bill once again calmedme down and told me to remember what he had said earlier, and moreover,Ishouldrealizethatthesereviewershadspentevenmoretimetryingtohelpme, and that’s why their comments were so long. Of course, Bill was correct,and both papers were greatly improved by my addressing the comments —not necessarily accepting the suggestions but addressing them. This lesson isimportant — if a reviewer complains about something and makes a suggestionas to how things should be changed, you as the author, needn’t accept thereviewer’s suggestion, but you should fix that thing to avoid the criticism.Iwasbeginningtolearnhowtocommunicate,whichistheentirepointofwriting journal articles or books.49.3 My first JASA submission“Characterizing the estimation of parameters in incomplete dataproblems.” Journal of the American Statistical Association, vol. 69(1974), pp. 467–474. [Number of citations: 177]This article concerns factoring likelihoods with missing data, which presentedgeneralizations and extensions of prior work done by Anderson (1957)and Lord (1955) concerning the estimation of parameters with special patternsof missing data. Here, the editorial situation was interesting because, whenIsubmittedthedraftin1970,theJASA Theory and Methods Editor was BradEfron, whom I had met a couple of years earlier when he visited Harvard, andthe Associate Editor was Paul Holland, my good friend and colleague at Harvard.So, I thought, finally, I will get a fast and snappy acceptance, maybeeven right away!No way! Paul must have (I thought) selected the most confused mathematicalstatisticians in the world — these reviewers didn’t grasp any of theinsights in my wonderous submission! And they complained about all sorts ofirrelevant things. There is no doubt that if it hadn’t been for Paul and Brad, itwould have taken years more to get it into JASA, or would have followed thepath of Rubin (1976) described below, or far worse. They were both helpfulin explaining that the reviewers were not idiots, and actually they had some


596 Converting rejectionsdecent suggestions, properly interpreted — moreover, they actually liked thepaper — which was very difficult to discern from the reports written for myeyes. Another set of lessons were apparent. First, read between the lines ofa report: Editors do not want to over commit for fear the author won’t payattention to the suggestions. Second, reinterpret editorial and reviewer’s suggestionsin ways that you believe improve the submission. Third, thank themin your reply for suggestions that improved the paper — they did spend timewriting reports, so acknowledge it. Fourth, it does help to have friends inpositions of power!49.4 Get it published!“Estimating causal effects of treatments in randomized and nonrandomizedstudies.” Journal of Educational Psychology, vol. 66 (1974),pp. 688–701. [Number of citations: 3084]This paper is the one that started my publishing trail to use the potentialoutcomes notation to define formally causal effects in all situations, not just inrandomized experiments as in Neyman (1923). Actually Neyman said he nevermade that generalization because he never thought of it, and anyway, doingso would be too speculative; see Rubin (2010) for the story on this. Everyonedealing with non-randomized studies for causal effects was using the observedvalue notation with one outcome (the observed value of the outcome) and oneindicator variable for treatments until this paper. So in fact, Rubin (1974a)was the initiating reason for the phrase “Rubin Causal Model” — RCM, coinedin Holland (1986).I wrote this in some form when I was still at Harvard, more as notes foran introductory statistics course for psychologists. Someone suggested thatI spruce it up a bit and submit it for publication. I did, but then couldn’tget it published anywhere! Every place that I submitted the piece, rejectedit. Sometimes the reason was that “every baby statistics student knows this”(I agreed that they should, but then show me where it is written!); sometimesthe reason was “it’s completely wrong”! And, in fact, I just received(July 2013) an email stating that “the Rubin definition of ‘causality’ is notappealing to many eminent statisticians.” Sometimes the comments were eveninsulting, especially so because I was submitting statistical work from my positionat Educational Testing Service (ETS) rather than a respected universitystatistics department. I asked around ETS and someone suggested the place,the Journal of Educational Statistics, whereitendedup—Ithinkthattheacceptance was because of some high level intervention from someone whodid like the paper but, more importantly, wanted to get me off his back —Ihonestlydonotrememberwhomtothank.


D.B. Rubin 597There are several lessons here. First, it demonstrates that if a publication isgood and good people find out about it (again, it helps to know good people),it will get read and cited. So if you are having this kind of problem withsomething that you are convinced is decent, get it published somewhere, andstart citing it yourself in your own publications that are less contentious, andnag your friends to do so! Second, if you are repeatedly told by some reviewersthat everyone knows what you are saying, but without specific references,and other reviewers are saying what you are writing is completely wrongbut without decent reasons, you are probably on to something. This view isreinforced by the next example. And it reiterates the point that it does helpto connect with influential and wise people.49.5 Find reviewers who understand“Inference and missing data.” Biometrika, vol. 63 (1976), pp. 581–592 (with discussion and reply). [Number of citations: 4185]This article is extremely well known because it established the basic terminologyfor missing data situations, which is now so standard that this paperoften isn’t cited for originating the ideas, although often the definitions aresummarized somewhat incorrectly. As Molenberghs (2007) wrote: “... it is fairto say that the advent of missing data methodology as a genuine field withinstatistics, with its proper terminology, taxonomy, notation and body of results,was initiated by Rubin’s (1976) landmark paper.” But was this a bearto get published! It was rejected, I think twice, from both sides of JASA; alsofrom JRSS B and I believe JRSS A. I then decided to make it more “mathy,”and I put in all this measure theory “window dressing” (a.s., a.e., both withrespect to different measures because I was doing Bayesian, repeated samplingand likelihood inference). Then it got rejected twice from The Annalsof Statistics, whereIthoughtIhadachancebecauseIknewtheEditor—knowing important people doesn’t always help. But when I told him my woesafter the second and final rejection from The Annals, and I asked his adviceon where I should send it next, he suggested “Yiddish Weekly” — what agreat guy!But I did not give up even though all the comments I received were verynegative; but to me, these comments were also very confused and very wrong.So I tried Biometrika — home run! David Cox liked it very much, and he gaveit to his PhD student, Rod Little, to read and to contribute a formal comment.All those prior rejections created, not only a wonderful publication, but leadto two wonderful friendships. The only real comment David had as the Editorwas to eliminate all that measure theory noise, not because it was wrongbut rather because it just added clutter to important ideas. Two importantmessages: First, persevere if you think that you have something important to


598 Converting rejectionssay, especially if the current reviewers seem not up to speed. Second, try tofind a sympathetic audience, and do not give up.49.6 Sometimes it’s easy, even with errors“Maximum likelihood from incomplete data via the EM algorithm.”Journal of the Royal Statistical Society, Series B, vol. 39 (1977), pp. 1–38 (joint work with A.P. Dempster and N. Laird, published with discussionand reply). [Number of citations: 34,453]Those early years at ETS were tough with respect to getting articles accepted,and I think it is tougher submitting from less academically prestigiousplaces. But publishing things became a bit easier as I matured. For example,the EM paper was accepted right away, with even invited discussion. It wasto be a read paper in London in 1976, the trip where I met Rod Little andDavid Cox in person — the latter mentioned that he really wasn’t fond of thetitle of the already accepted Rubin (1976) because something that’s missingcan’t be “given” — the Latin meaning of data. And this rapid acceptance forthe EM paper was despite having one of its proofs wrong — misapplication ofthe triangle inequality! Wu (1983) corrected this error, which was not criticalto the fundamental ideas in the paper about the generality of the missingdata perspective. In statistics, ideas trump mathematics — see Little’s (2013)Fisher lecture for more support for this position. In this case, a rapid acceptanceallowed an error to be published and corrected by someone else. If thiscan be avoided it should be, even if it means withdrawing an accepted paper;three examples of this follow.49.7 It sometimes pays to withdraw the paper!It sometimes pays to withdraw a paper. It can be good, it can be important,and even crucial at times, as the following examples show.49.7.1 It’s good to withdraw to complete an idea“Parameter expansion to accelerate EM: The PX-EM algorithm.”Biometrika, vol. 85 (1998), pp. 755–770 (joint work with C.H. Liu andY.N. Wu). [Number of citations: 243]This submission was done jointly with two exceptionally talented formerPhD students of mine, Chuanhai Liu and Ying Nian Wu. It was a technically


D.B. Rubin 599very sound article, which introduced the PX-EM algorithm, an extension ofEM. If correctly implemented, it always converged in fewer steps than EM— nice. But after the submission was accepted by an old friend, Mike Titteringtonat Biometrika, there was an intuitive connection that I knew had tobe there, but that we had not included formally; this was the connection betweenPX-EM and ANCOVA, which generally creates more efficient estimatedtreatment effects by estimating a parameter whose value is known to be zero(e.g., the difference in the expected means of covariates in the treatment andcontrol groups in a completely randomized experiment is zero, but ANCOVAestimates it by the difference in sample means). That’s what PX-EM does —it introduces a parameter whose value is known, but estimates that knownvalue at each iteration, and uses the difference between the estimate and theknown value to obtain a larger increase in the actual likelihood. But we hadn’tdone the formal math; so we withdrew the accepted paper to work on that.Both Chuanhai and Yingnian were fine with that decision. My memoryis that we basically destroyed part of a Christmas holiday getting the ideadown correctly. We were now ready to resubmit, and it was not surprisingthat it was re-accepted overnight, I think. Another lesson: Try to make eachpublication as clean as possible — you and your coauthors will have to livewith the published result forever, or until someone cleans it up!49.7.2 It’s important to withdraw to avoid having a marginalapplication“Principal stratification for causal inference with extended partialcompliance: Application to Efron–Feldman data.” Journal of theAmerican Statistical Association, vol.103(2008),pp.101–111(jointwork with H. Jin). [Number of citations: 65]This paper re-analyzed a data set from an article (Efron and Feldman,1991) on noncompliance, but I think that Hui Jin and I approached it moreappropriately using principal stratification (Frangakis and Rubin, 2002). I hada decade to ponder the issues, the benefit of two great economics coauthorsin the interim (Angrist et al., 1996), a wonderful PhD student (ConstantineFrangakis) to help formulate a general framework, and a great PhD studentto work on the example. The submission was accepted fairly quickly, but asit was about to go to the Copy Editors, I was having my doubts about thelast section, which I really liked in principle, but the actual application didn’tmake complete scientific sense, based on my experience consulting on variouspharmaceutical projects. So I wanted to withdraw and to ask my coauthor,who had done all the extensive computing very skillfully, to do all sorts ofnew computing. Her initial reaction was something like: Had I lost my mind?Withdraw a paper already accepted in JASA?! But wasn’t the objective ofwriting and rewriting to get the paper accepted? But after listening to thereasons, she went along with my temporary insanity, and she executed the


600 Converting rejectionsfinal analyses that made scientific sense with great skill and care. Of course thepaper was re-accepted. And it won the Mitchell prize at the Joint StatisticalMeetings in 2009 for the best Bayesian paper.The message here is partly a repeat of the above one regarding publishingthe best version that you can, but it is more relevant to junior authors anxiousfor publications. I surely know how difficult it can be, certainly in the earlyyears, to build a CV and get promoted; but that’s short term. Eventually realquality will triumph, and don’t publish anything that you think may hauntyou in the future, even if it’s accepted in a top journal. As Pixar’s Jay Shusterput it: “Pain is temporary, ‘suck’ is forever.” By the way, Hui Jin now has anexcellent position at the International Monetary Fund.49.7.3 It’s really important to withdraw to fix it up“Multiple imputation by ordered monotone blocks with applicationto the Anthrax Vaccine Research Program,” Journal of Computationaland Graphical Statistics; 2013; in press (joint work with F. Li, M.Baccini, F. Mealli, E.R. Zell, and C.E. Frangakis)This publication hasn’t yet appeared, at least at the time of my writingthis, but it emphasizes the same point, with a slightly different twist becauseof the multiplicity of coauthors of varying seniority. This paper grew out ofamassivejointeffortbymanypeople,eachdoingdifferentthingsonamajorproject. I played the role of the MI-guru and organizer, and the others wereabsorbed with various computing, writing, and data analytic roles. Writing adocument with five major actors was complicated and relatively disorganized— the latter issue, my fault. But then all of a sudden, the paper was written,submitted, and remarkably the first revision was accepted! I now had to readthe entire thing, which had been “written” by a committee of six, only twoof which were native English speakers! Although some of the writing wasgood, there were parts that were confusing and other parts that appearedto be contradictory. Moreover, on closer examination, there were parts whereit appeared that mistakes had been made, mistakes that would take vastamounts of time to correct fully, but that only affected a small and orthogonalpart of the paper. These problems were really only evident to someone who hadan overview of the entire project (e.g., me), not reviewers of the submission.Iemailedmycoauthors(someofwhomwereacrosstheAtlantic)thatIwantedtowithdrawandrewrite.Initially,thereseemedtobesomeshock(but wasn’t the purpose of writing to get things published?), but they agreed— the more senior authors essentially immediately, and more junior ones aftera bit of contemplation. The Editor who was handling this paper (RichardLevine) made the whole process as painless as possible. The revision tookmonths to complete; and it was re-accepted over a weekend. And I’m proudof the result. Same message, in some sense, but wise Editors want to publishgood things just as much as authors want to publish in top flight journals.


D.B. Rubin 60149.8 ConclusionIhavebeenincrediblyfortunatetohaveaccesstosageadvicefromwonderfulmentors, obviously including advice about how to react to rejected submissions.It may not always be true, and I do know of some gross examples, butin the vast majority of cases, Editors and reviewers are giving up their time totry to help authors, and, I believe, are often especially generous and helpful toyounger or inexperienced authors. Do not read into rejection letters personalattacks, which are extremely rare. The reviewers may not be right, but onlyin rare situations, which I believe occur with submissions from more seniorauthors, who are “doing battle” with the current Editors, is there any personalanimus. As Cochran pointed out to me about 1970, they probably don’t evenknow anything about you, especially if you’re young. So my advice is: Qualitytrumps quantity, and stick with good ideas even when you have to do battlewith the Editors and reviewers — they are not perfect judges but they are,almost uniformly, on your side.Whatever wisdom is offered by this “fireside chat” on dealing with rejectionsof journal submissions, owes a huge debt to the advice of my mentorsand very respected folks along my path. So with the permission of the Editorsof this volume, I will follow with a description of my incredible good fortuneto meet such folks. As one of the wisest folks in our field (his name is hiddenamong the authors of the additional references) once said to me: If youask successful people for their advice on how to be successful, their answersare, “Be more like me.” I agree, but with the addition: “And meet wonderfulpeople.” This statement creates a natural transition to the second part of mycontribution to this 50th anniversary volume, on the importance of listening towise mentors and sage colleagues. I actually wrote the second part before thefirst part, but on rereading it, I feared that it suffered from two problems; one,it sounded too self-congratulatory and almost elitist. The Editors disagreedand thought it actually could be a helpful chapter for some younger readers,perhaps because it does illustrate how good fortune plays such an importantrole, and I certainly have had that with respect to the wonderful influencesI’ve had in my life. The advice: Take advantage of such good fortune!ReferencesAngrist, J.D., Imbens, G.W., and Rubin, D.B. (1996). Identification of causaleffects using instrumental variables. Journal of the American StatisticalAssociation, 94:444–472.


602 Converting rejectionsAnderson, T.W. (1957). Maximum likelihood estimates for a multivariatenormal distribution when some observations are missing. Journal of theAmerican Statistical Association, 52:200–203.Cochran, W.G. (1968). The effectiveness of adjustment by subclassificationin removing bias in observational studies. Biometrics, 24:295–313.Dempster, A.P., Laird, N., and Rubin, D.B. (1977). Maximum likelihoodfrom incomplete data via the EM Algorithm. Journal of the Royal StatisticalSociety, Series B, 39:1–38.Efron, B. and Feldman, D. (1991). Compliance as an explanatory variable inclinical trials. Journal of the American Statistical Association, 86:9–17.Frangakis, C.E. and Rubin, D.B. (2002). Principal stratification in causalinference. Biometrics, 58:21–29.Holland, P.M. (1986). Statistics and causal inference. Journal of the AmericanStatistical Association, 81:945–960.Jin, H. and Rubin, D.B. (2008). Principal stratification for causal inferencewith extended partial compliance: Application to Efron–Feldman data.Journal of the American Statistical Association, 103:101–111.Li, F., Baccini, M., Mealli, F., Zell, E.R., Frangakis, C.E., and Rubin, D.B.(2013). Multiple imputation by ordered monotone blocks with applicationto the Anthrax Vaccine Research Program. Journal of Computational andGraphical Statistics, inpress.Little, R.J.A. (2013). In praise of simplicity not mathematistry! Ten simplepowerful ideas for the statistical scientist. Journal of the AmericanStatistical Association, 108:359–369.Little, R.J.A. and Rubin, D.B. (2002). Statistical Analysis with Missing Data,2nd edition. Wiley, New York.Liu, C.H., Rubin, D.B., and Wu, Y.N. (1998). Parameter expansion to accelerateEM: The PX-EM Algorithm. Biometrika, 85:755–770.Lord, F.M. (1955). Estimation of parameters from incomplete data. Journalof the American Statistical Association, 50:870–876.Molenberghs, G. (2007). What to do with missing data? Journal of the RoyalStatistical Society, Series A, 170:861–863.Neyman, J. (1923). On the application of probability theory to agriculturalexperiments: Essay on principles. Section 9. Roczniki Nauk Rolniczych,10:1–51. [English translation of the original Polish article available inStatistical Science, 5:465–472.]


D.B. Rubin 603Rubin, D.B. (1972). A non iterative algorithm for least squares estimationof missing values in any analysis of variance design. Journal of the RoyalStatistical Society, Series C, 21:136–141.Rubin, D.B. (1973a). Matching to remove bias in observational studies. Biometrics,29:159–183.[Printer’scorrectionnote30,p.728.]Rubin, D.B. (1973b). The use of matched sampling and regression adjustmentto remove bias in observational studies. Biometrics, 29:184–203.Rubin, D.B. (1974a). Characterizing the estimation of parameters in incompletedata problems. The Journal of the American Statistical Association,69:467–474.Rubin, D.B. (1974b). Estimating causal effects of treatments in randomizedand nonrandomized studies. Journal of Educational Psychology, 66:688–701.Rubin, D.B. (1976). Inference and missing data. Biometrika, 63:581–592.Rubin, D.B. (2010). Reflections stimulated by the comments of Shadish(2009) and West & Thoemmes (2009). Psychological Methods, 15:38–46.Wu, C.F.J. (1983). On the convergence properties of the EM algorithm. TheAnnals of Statistics, 11:95–103.


50The importance of mentorsDonald B. RubinDepartment of StatisticsHarvard University, Cambridge, MAOn this 50th anniversary of the COPSS, I feel incredibly fortunate to havestumbled into the career path that I followed, which appears, even in hindsight,like an overgrown trail in the woods that somehow led to a spectacularlocation. In some sense, there is an odd coincidence in that it is also roughlythe 50th anniversary of my recognizing the field of statistics as a valuable one.When thinking about what I could say here that might interest or help others,and is not available in other places, I eventually decided to write about mypath and the importance, to me at least, of having wonderful “mentors” withdifferent backgrounds, which allowed me to appreciate many different modesof productive thinking. Probably the characteristic of the field of statisticsthat makes it so appealing to me is the wonderful breadth of intellectual topicsthat it touches. Many of my statistical mentors had a deep appreciationfor this, and for that I will always be grateful, but also I have always felt veryfortunate to have had admirable mentors from other disciplines as well.50.1 My early yearsI grew up in Evanston, Illinois, home of Northwestern University. As a kid,I was heavily influenced intellectually by a collection of people from variousdirections. My father was one of four brothers, all of whom were lawyers, andwe used to have stimulating arguments about all sorts of topics; arguing wasnot a hostile activity but rather an intellectually engaging one. Probably themost argumentative was Uncle Sy, from DC, who had framed personal lettersof thanks for service from, eventually, all the presidents starting with HarryTruman going through Gerald Ford, as well as some contenders, such as AdlaiStevenson, and various Supreme Court Justices, and even Eleanor Rooseveltwith whom I shook hands back then. It was a daunting experience, not onlybecause of her reputation, but also because, according to my memory, she was605


606 Importance of mentorstwice as tall as I was! The relevance of this to the field of statistics is that itcreated in me a deep respect for the principles of our legal system, to whichI find statistics deeply relevant, for example, concerning issues as diverse as thedeath penalty, affirmative action, tobacco litigation, ground water pollution,wire fraud, etc.But the uncle who was most influential on my eventual interest in statisticswas my mother’s brother, a dentist (then a bachelor), who loved to gamblesmall amounts, either in the bleachers at Wrigley Field, betting on the outcomeof the next pitch while we watched the Cubs lose, or at Arlington Race track,where I was taught as a wee lad how to read the Racing Form and estimatethe “true” odds from the various displayed betting pools while losing twodollar bets. Wednesday and Saturday afternoons, during the warm monthsthen, were times to learn statistics — even if at various bookie joints thatwere sometimes raided. As I recall, I was a decent student of his, but still lostsmall amounts — this taught me to never gamble with machines. Later, asa PhD student at Harvard, I learned never to “gamble” with “Pros.” Fromthose days I am reminded of the W.C. Fields line who, when playing pokerfor money on a public train car, was chastised by an older woman: “Youngman, don’t you know that gambling is a sin?” He replied, “The way I playit, it’s not gambling.” The Harvard Pro’s were not gambling when playingagainst me.There were two other important influences on my statistical interests fromthe late 1950s and early 1960s. First there was an old friend of my father’sfrom their government days together, a Professor Emeritus of Economics atthe University of California Berkeley, George Mehren, with whom I had manyentertaining and educational (to me) arguments, which generated a respectfor economics, which continues to grow to this day. And second, my wonderfulteacher of physics at Evanston Township High School — Robert Anspaugh —who tried to teach me to think like a scientist, and how to use mathematicsin pursuit of science. So by the time I left high school for college, I appreciatedsome probabilistic thinking from gambling, some scientific thinking fromphysics, and I had deep respect for disciplines other than formal mathematics.These, in hindsight, are exposures that were crucial to the kind of formalstatistics to which I gravitated as I matured.50.2 The years at Princeton UniversityWhen I entered Princeton in 1961, like many kids at the time, I had a pile ofadvanced placements, which lined me up for a BA in three years, but unknownto me before I entered, I was also lined up for a crazy plan to get a PhD inphysics in five years, in a program being proposed by John Wheeler, a wellknownprofessor of Physics there (and Richard Feynman’s PhD advisor years


D.B. Rubin 607earlier). Wheeler was a fabulous teacher, truly inspirational. Within the firstweek, he presented various “paradoxes” generated by special relativity, introducedthe basics of quantum mechanics, gave homework problems designed tostimulate intuitive but precise thinking (e.g., estimate — by careful reasoning— how far a wild goose can fly), pointed out errors in our current text (e.g.,coherent light cannot be created — it can — lasers were invented about a yearafter this edition was published), and many other features of scientific thinkingthat are critically important but often nearly absent from some people’sstatistical thinking, either because they do not have the requisite mathematicalbackground (and sometimes appear to think that algorithmic thinking is asubstitute) or because they are still enamored with thoughtless mathematicalmanipulations, or perhaps some other reason.In any case, my physics lessons from Anspaugh and Wheeler were crucial tomy thinking, especially the naturalness of two closely related messages: Timemarches on, and despite more recent “back to the future” movies, we cannotgo back in time, which leads directly to the second message — in any scientificproblem, there will always exist missing data, the “observer effect” (intuitivelyrelated to the Heisenberg uncertainty principle, but different). That is, youcannot precisely measure both position and momentum at the same point intime, say t, because the physical act of measuring one of them at t affects theother’s value after t; this is just like the fact that you cannot go back in timeto give the other treatment in a causal inference problem, and the choice ofnotation and problem formulation should reflect these facts. All of statisticsshould be formulated as missing data problems (my view since about 1970,although not everyone’s).But like many kids of that age, I was torn by competing demands abouthow to grow up, as well as larger social issues of that time, such as our involvementin Vietnam. And Wheeler took a leave of absence, I think to visitTexas Austin in my second year, so I switched fields. My exact reasoning fromthat time is a bit fuzzy, and although I continued to take some more advancedphysics courses, I switched from Physics to Psychology towards the end of mysecond year, where my mathematical and scientific background seemed bothrare and appreciated, whereas in math and physics, at least in my cohort,both skill sets were good, especially so in physics, but not rare. This decisionwas an immature one (not sure what a mature one would have been), but afine decision because it introduced me to some new ways of thinking as wellas to new fabulous academic mentors.First, there was a wonderful Psychologist, Silvan Tomkins, author of thethree volume “Affect, Imagery, Consciousness,” who introduced me to SigmundFreud’s work, and other philosopher/psychologists on whose work hisown book built. I was amazed that interpreting dreams of strangers actuallyworked much of the time; if I asked the right questions about their dreams,I could quite often tell things about strangers such as recondite fears or aspirations!There may really exist a “collective unconscious” to use Jung’s phrase.In any case, I developed a new respect for psychology, including for their neat


608 Importance of mentorsexperiments to assess creativity and such “soft” concepts — there was realscientific understanding among many psychologists, and so much yet to learnabout the mind! So armed with that view that there is much to do in thatdirection, in my final year, I actually applied to PhD programs in psychology,and was accepted at Stanford, the University of Michigan, and Harvard.Stanford was the strongest technically, with a very quiet but wonderfulprofessor who subsequently moved to Harvard, Bill Estes. Michigan had a verystrong mathematical psychology program, and when I visited there in springof 1965, I was hosted primarily by a very promising graduating PhD student,Amos Tversky, who subsequently wrote extremely influential and Nobel Prize(in economics) winning work with Danny Kahneman. Amos’ work, even in1965, was obviously great stuff, but I decided on Harvard, for the wrongreason (girlfriend on the East coast), but meeting Bill and Amos, and hearingthe directions of their work, confirmed the idea that being in psychology wasgoing to work out well — until I got to Harvard.50.3 Harvard University — the early yearsMy start at Harvard in the Department of Social Relations, which was thehome of psychology back then, was disappointing, to say the least. First,all sorts of verbal agreements, established on my visit only months beforewith a senior faculty member, were totally forgotten! I was told that my undergraduateeducation was scientifically deficient because it lacked “methodsand statistics” courses, and I would have to take them at Harvard or withdraw.Because of all the math and physics that I’d had at Princeton, I wasinsulted! And because I had independent funding from an NSF graduate fellowship,I found, what was essentially, a Computer Science (CS) program,which seemed happy to have me, probably because I knew Fortran, and hadused it extensively at Princeton; but I also found some real math courses andones in CS on “mathy” topics, such as computational complexity, more interestingthan the CS ones, although it was clear that computers, as they wereevolving, were going to change the way much of science was done.But what to do with my academic career? The military draft was still inplace, and neither Vietnam nor Canada seemed appealing. And I had pickedup a Master’s degree from CS in the spring of 1966.AsummerjobinPrincetonin1966leadtoaninterestingsuggestion.Iwasdoing some programming for John Tukey and some consulting for a PrincetonSociology Professor, Robert Althauser, basically writing programs to domatched sampling; Althauser seemed impressed by my ability to program andto do mathematics, and we discussed my future plans — he mentioned FredMosteller and the decade old Statistics Department at Harvard; he suggestedthat I look into it. I did, and by fall of 1968, I was trying my third PhD


D.B. Rubin 609program at Harvard, still using the NSF funding that was again renewed. Agreat and final field change!But my years in CS were good ones too. And the background in CS wasextremely useful in Statistics — for doing my own work and for helping otherPhD students. But an aside about a lesson I learned then: When changingjobs, never admit you know anything about computing or you will never haveany time to yourself! After a couple of years of denying any knowledge aboutcomputers, no one will ask, and furthermore, by then, you will be totallyignorant about anything new and practical in the world of computing, in anycase — at least I was.50.4 My years in statistics as a PhD studentThese were great years, with superb mentoring by senior folks: Fred Mosteller,who taught me about the value of careful, precise writing and about responsibilitiesto the profession; Art Dempster, who continued the lessons aboutscientific thinking I learned earlier, by focusing his statistics on principlesrather than ad hoc procedures; and of course, Bill Cochran, a wonderfullywise and kind person with a fabulous dry sense of humor, who really taughtme what the field of statistics, at least to him, concerned. Also importantwas meeting life-long friends, such as Paul Holland, as a junior faculty member.Also, there were other faculty with whom I became life-long friends, inparticular, Bob Rosenthal, a professor in psychology — we met in a Cochranseminar on experimental design. Bob has great statistical insights, especiallyin design, but did not have the mathematical background to do any “heavylifting”in this direction, but this connection helped to preserve the long-terminterests in psychology. Bob was a mentor in many ways, but one of the mostimportant was how to be a good professor for your students — they deserveaccess to your time and mind and its accumulated wisdom.Another psychology faculty member, whom I met in the summer of 1965and greatly influenced me, was Julian Jaynes from Princeton, who became relativelyfamous for his book “The Origin of Consciousness in the Breakdownof the Bicameral Mind” — a spectacularly interesting person, with whomI became very close during my post-graduate years when I was at ETS (theEducational Testing Services) in Princeton. A bit more, shortly, on his influenceon my thinking about the importance of bridging ideas across disciples.After finishing my graduate work in 1970, I stayed around Harvard Statisticsfor one more year as a faculty member co-teaching with Bob Rosenthalthe “Statistics for Psychologists” course that, ironically, the Social RelationsDepartment wanted me to take five years earlier, thereby driving me out oftheir program! I decided after that year that being a junior faculty member,even in a great department, was not for me. So I ended up accepting a fine


610 Importance of mentorsposition at ETS in Princeton, New Jersey, where I also taught part-time atPrinceton’s young Statistics Department, which renewed my friendship withTukey; between the two positions, my annual salary was more than twice whatI could be offered at Harvard to stay there as junior faculty.50.5 The decade at ETSThe time at ETS really encouraged many of my earlier applied and theoreticalconnections — it was like an academic position with teaching responsibilitiesreplaced by consulting on ETS’s social science problems, including psychologicaland educational testing ones; and I had the academic connection atPrinceton, where for several years I taught one course a year. My ETS boss, AlBeaton, had a Harvard Doctorate in Education, and had worked with Dempsteron computational issues, such as the “sweep operator.” Al was a very niceguy with deep understanding of practical computing issues. These were greattimes for me, with tremendous freedom to pursue what I regarded as importantwork. Also in those early years I had the freedom to remain in close contactwith Cochran, Dempster, Holland, and Rosenthal, which was very importantto me and fully encouraged by Beaton. I also had a Guggenheim fellowship in1978, during which I spent a semester teaching causal inference back at Harvard.A few years before I had visited the University of California Berkeley fora semester, where I was given an office next to Jerzy Neyman, who was thenretired but very active — a great European gentleman, who clearly knew thedifference between mathematical statistics for publishing and real statisticsfor science — there is no doubt that I learned from him, not a mentor as such,but as a patient and kind scholar interested in helping younger people, evenone from ETS.Here’s where Julian Jaynes re-enters the picture in a major way. We becamevery close friends, having dinner and drinks together several times a weekat a basement restaurant/bar in Princeton called the Annex. We would havelong discussions about psychology and scientific evidence, e.g., what makesfor consciousness. His knowledge of history and of psychology was voluminous,and he, in combination with Rosenthal and the issues at ETS, certainlycemented my fascination with social science generally. A different style mentor,with a truly eye-opening view of the world.


D.B. Rubin 61150.6 Interim time in DC at EPA, at the University ofWisconsin, and the University of ChicagoSometime around 1978 I was asked to be the Coordinating and ApplicationsEditor of JASA. Stephen Stigler was then Theory and Methods Editor. I hadpreviously served as an Associate Editor for Morrie DeGroot, and was becomingrelatively more well known, for things like the EM algorithm and differentcontributions that appeared in various statistical and social science journals,so more options were arising. I spent two weeks in December 1978 at theEnvironmental Protection Agency in the Senior Executive Service (very longstory), but like most things that happened to me, it was very fortunate. I wasin charge of a couple of statistical projects with connections to future mentors,one with a connection to Herman Chernoff (then at MIT), and one witha connection to George Box; George and I really hit it off, primarily becauseof his insistence on statistics having connections to real problems, but alsobecause of his wonderful sense of humor, which was witty and ribald, and hislove of good spirits.Previously I had met David Cox, via my 1976 Biometrika paper “Inferenceand missing data” discussed by Cox’s then PhD student and subsequently mygreat coauthor, Rod Little. I found that the British style of statistics fit fabulouslywith my own interests, and the senior British trio, Box, Cochran andCox, were models for the kind of statistician I wanted to be. I also participatedwith Box in several Gordon Conferences on Statistics and Chemistry in thelate 1970s and early 1980s, where George could unleash his “casual” side. Ofsome importance to my applied side, at one of these I met, and became goodfriends with, Lewis Sheiner, UCSF Pharmacology Professor. Lewis was a verywise doctor with remarkably good statistical understanding, who did a lotof consulting for FDA and for pharmaceutical companies, which opened upanother connection to an applied discipline for me, in which I am still active,with folks at FDA and individuals in the pharmaceutical world.In any case, the EPA position led to an invitation to visit Box at the MathResearch Center at the University of Wisconsin, which I gladly accepted.Another great year with long-term friends and good memories. But via SteveStigler and other University of Chicago connections, a full professor positionwas offered, jointly in Statistics and in the Department of Education. I wasthere for only two years, but another wonderful place to be with more superbmentors; David Wallace and Paul Meier, in particular, were especially helpfulto me in my first full-time full professor position. I also had a connection tothe National Opinion Research Corporation, which was important. It not onlywas the home of the first grant to support multiple imputation, but becausethey did real survey work, they were actually interested in my weird ideasabout surveys! And because they also did work in economics, this initiated abridge to that wonderful field that is still growing for me. Great times.


612 Importance of mentors50.7 The three decades at HarvardI’m just completing my 30th year at the Harvard Department of Statistics, andthese years have been fabulous ones, too. The first of those years renewed andreinforced my collaborations with Bob Rosenthal, through our co-teaching a“statistics for psychologists” course and our Thursday “brown-bag consulting”lunch. Other psychologists there have been influential as well, such as JerryKagan, a wonderfully thoughtful guy with a fabulous sense of humor, who wasa great mentor regarding personality theory, as was Phil Holzman with hisfocus on schizophrenia. We would all meet at Bill and Kay Estes’s spectacularChristmas parties at their “William James” house, with notable guests such asJulia Child, who lived down the block and reminded me of Eleanor Roosevelt.These personal connections to deep-thinking psychologists clearly affect theway I approach problems.These early years as Professor at Harvard also saw a real attempt to createsomething of a bridge to economics in Cambridge, initially through some 1990sefforts with Bob Solow and then Josh Angrist, both at MIT, and of course myclose colleague Guido Imbens now at Stanford, and then again with Guidomore recently in the context of our causal book and our co-taught course.Also, economist Eric Maskin, who recently returned to Harvard after a stintin Princeton, convinced me to teach a “baby causal” course in the “core” forundergraduates who had no background in anything technical — it was goodfor me and my teaching fellows, and hopefully some of those who took thecourse. Another economist who influenced me was the Dean who “hired me”— Henry Rosovsky — one of the wisest and most down-to-earth men I haveever met; we shared many common interests, such as classic cars and goodlunches. A wonderful mentor about academic life! Every academic should readhis book: “The University: An Owner’s Manual.”And of course there were the senior folks in Statistics: Art Dempster withhis principled approach to statistics, was always a pleasure to observe; FredMosteller and his push for collaborations and clear writing; and Herman Chernoff(whom I hired; he used to refer to me as his “boss” — hmm, despite mybeing over 20 years his junior). Herman attended and still, at 90 years, attendsmost of our seminars and offers penetrating comments — a fabulous colleaguewith a fabulous mind and subtle and clever sense of humor. And old friendCarl Morris — always a great colleague.50.8 ConclusionsI have intentionally focused on mentors of mine who were (or are) older,despite the undeniable fact that I have learned tremendous amounts from


D.B. Rubin 613my colleagues and students. I also apologize for any mentors whom I haveaccidentally omitted — I’m sure that there are some. But to all, thanks somuch for the guidance and advice that have led me to being a statisticianwith a variety of interests. My career would have told a very different story ifI had not had all the wonderful guidance that I have received. I would haveprobably ended up in some swamp off that tangled path in the woods. I thinkthat my realizing that fact has greatly contributed to my own desire to helpguide my own students and younger colleagues. I hope that I continue to beblessed with mentors, students and colleagues like the ones I’ve had in thepast, until we all celebrate the 100th anniversary of the COPSS!


51Never ask for or give advice, makemistakes, accept mediocrity, enthuseTerry SpeedDivision of BioinformaticsWalter and Eliza Hall Institute of Medical ResearchandDepartment of StatisticsUniversity of California, Berkeley, CAYes, that’s my advice to statisticians. Especially you, dear reader under 40, foryou are one of the people most likely to ask an older statistician for advice.But also you, dear reader over 40, for you are one of the people to whomyounger statisticians are most likely to turn for advice.Why 40? In the 1960s, which I lived through, the mantra was Never trustanyone over 30. Times change, and now 40 is (approximately) the cut-off forthe COPSS Presidents’ Award, so I think it’s a reasonable dividing line forseparating advisors and advisees. Of course people can and do give and takeadvice at any age, but I think we regard advice from peers very differentlyfrom advice from... advisors. That’s what I’m advising against. Please don’tget me wrong: I’m not being ageist here, at least not consciously. I’m being asplitter.Where am I going with all this? There is a sentence that used to be hearda lot on TV shows, both seriously and in jest: “Don’t try this at home.” Itwas usually said after showing a stupid or dangerous act, and was a way ofdisclaiming liability, as they knew it wouldn’t work well for most viewers.I often feel that people who give advice should act similarly, ending theiradvice with “Don’t take my advice!”51.1 Never ask for or give adviceWhat’s wrong with advice? For a start, people giving advice lie. That theydo so with the best intentions doesn’t alter this fact. This point has been615


616 Never ask for or give advicesummarized nicely by Radhika Nagpal (2013). I say trust the people who tellyou “I have no idea what I’d do in a comparable situation. Perhaps toss acoin.” Of course people don’t say that, they tell you what they’d like to do orwish they had done in some comparable situation. You can hope for better.What do statisticians do when we have to choose between treatments Aand B, where there is genuine uncertainty within the expert community aboutthe preferred treatment? Do we look for a statistician over 40 and ask themwhich treatment we should choose? We don’t, we recommend running a randomizedexperiment, ideally a double-blind one, and we hope to achieve a highadherence to the assigned treatment from our subjects. So, if you really don’tknow what to do, forget advice, just toss a coin, and do exactly what it tellsyou. But you are an experiment with n = 1, you protest. Precisely. What doyou prefer with n = 1: an observational study or a randomized trial? (It’s apity the experiment can’t be singly, much less doubly blinded.)You may wonder whether a randomized trial is justified in your circumstances.That’s a very important point. Is it true that there is genuine uncertaintywithin the expert community (i.e., you) about the preferred courseof action? If not, then choosing at random between your two options is notonly unethical, it’s stupid. And who decides whether or not there is genuineuncertainty in your mind: you or the people to whom you might turn for advice?This brings me to the most valuable role potential advisors can playfor potential advisees, the one I offer when people ask me for advice. I reply“I don’t give advice, but I’m very happy to listen and talk. Let’s begin.” Thisrole cannot be replaced by words in a book like this, or on a website.51.2 Make mistakesWhat if it turns out that you made a wrong decision? I’ll pass over the importantquestion of how you learned that it was the wrong decision, of howyou tell that the other decision would have been better. That would take meinto the world of counterfactuals and causal inference, and I’ve reserved mynext lifetime for a close study of that topic. But let’s suppose you really didmake a mistake: is that so bad?There is a modest literature on the virtues of making mistakes, and I liketo refer people to it as often as possible. Why? Because I find that too manypeople in our business — especially young people — seem to be unduly riskaverse. It’s fine not wanting to lose your money in a casino (though winninghas a certain appeal), but always choosing the safe course throughout a careerseems sad to me. I think there’s a lot to be gained from a modest amountof risk-taking, especially when that means doing something you would liketo do, and not what your advisor or department chair or favorite COPSSaward winner thinks you should do. However, to call it literature might be


T. Speed 617overly generous. Perhaps a better description is body of platitudes, slogans andepigrams by pundits and leaders from business, sport, and the arts. One thingis certain: professors do not figure prominently in this “literature.” Nothingventured, nothing gained, catchestherecurrenttheme.TheplaywrightGeorgeBernard Shaw wrote: Alifespentmakingmistakesisnotonlymorehonorable,but more useful than a life spent doing nothing, and many others have echoedhis words. Another Irish playwright, Oscar Wilde was more forthright: Mostpeople die of a sort of creeping common sense, and discover when it is too latethat the only things one never regrets are one’s mistakes.Then there is the view of mistakes as essential for learning. That is nowherebetter illustrated than in the process of learning to be a surgeon. Everyoneshould read the chapter in the book by Atul Gawande (Gawande, 2002) entitledWhen Doctors Make Mistakes. Again, my “mistake literature” is clear onthis. Oscar Wilde once more: Experience is the name everyone gives to theirmistakes. As statisticians, we rarely get to bury our mistakes, so let’s all makea few more!51.3 Accept mediocrityWhat’s so good about mediocrity? Well, it applies to most of us. Rememberthe bell curve? Where is it highest? Also, when we condition upon something,we regress towards “mediocrity,” the term chosen by Galton (Galton,1886). Let’s learn to love it. When I was younger I read biographies (Gauss,Abel, Kovaleskaya, von Neumann, Turing, Ramanujan, ...) and autobiographies(Wiener, Hardy, Russell, ...) of famous mathematicians. I found themall inspiring, interesting and informative, but light-years from me, for theywere all great mathematicians, whereas I was very mediocre one.At the time I thought I might one day write Memoirs of a Mediocre Mathematician,to encourage others like myself, people near the mode of the curve.However, I didn’t stay a mathematician long enough for this project to getoff the ground. Mediocre indeed. Later I considered writing of Stories froma Second-Rate Statistician, butrejectedthatastooimmodest.PerhapsTalesfrom a Third-Rate Theorist, or Confessions of a C-Grade Calculator, or Diaryof a D-Grade Data Analyst, or News from an Nth-rate Number Cruncher?You can see my goal: to have biographical material which can both inspire,interest and inform, but at the same time, encourage, not discourage youngstatisticians. To tell my readers: I do not live on another planet, I’m like you,both feet firmly planted on Planet Earth. Maybe my goal is mistaken (seeabove), but I do remember enjoying reading The Diary of a Nobody manyyears ago. You see, I do not believe we can all be whatever we want to be,that all that matters is that we want to be something or be someone, andthat if we want it enough, we can achieve it. Without wishing to discourage


618 Never ask for or give adviceyounger readers who have yet to notice, our ability to make ourselves faster,higher and stronger, not to mention smarter and deeper, is rather limited. Yes,it would be great to win a Gold Medal at the Olympics, or a Nobel Prize or aFields Medal, or even to win the COPSS Presidents’ Award, but my view isthat for all but a vanishingly small number of us, such goals are unachievable,no matter how much time and effort we put in. This is not to say that thepeople who do achieve these goals can do so without expending considerabletime and effort, for there is plenty of literature (no quotes now) suggestingthat they must. My point is that time and effort are usually not sufficient tobring about dramatic changes in us, that we are what we are.Also when I was young, I read a statement along the following lines:Worldly acclaim is the hallmark of mediocrity. Idon’trememberwhereIsawit, and can’t find it now, but I liked it, and it has stuck in my head. I usedto think of it every time I saw someone else get a prize or receive some otherkind of acclaim. I would think “Don’t feel too bad, Terry. Galois, Van Gogh,Mozart, Harrison, Mendel and many others all had to wait until after theydied to be acclaimed geniuses; your time will come.” Of course I always knewthat I wasn’t in the same class as these geniuses, and, as if to prove that,I came in due course to win some awards. But it still sticks in my mind: thattrue recognition is what comes after we die, and we shouldn’t be too concernedwith what comes in our lifetime. I think we’d all be better off accepting whatwe are, and trying to be a better one of those, than trying to achieve theunachievable. If that means accepting mediocrity, so be it, but then let’s aimto be the best – fill in your name – on the planet. Let’s be ourselves firstand foremost. I think being happy with what we are, while working to makerealistic improvements, is a great start to achieving more than we might initiallythink we can achieve. Unfortunately I can’t leave this theme withoutpointing out that our profession is multidimensional, not one-dimensional, soit is likely that the concept of “best” doesn’t even make sense here. We don’thave competitions and rankings like chess or tennis players; we try to bringall our skill and experience to bear on any given statistical problem, in thehope that we can find a good answer. But we never have to say we’re certain.On the other hand, there may well be a dimension along which you can bethe best.51.4 EnthuseWhy enthuse? Enjoyment of our job is one of the things that distinguish peoplelike us — teachers, researchers, scholars — from the majority of our fellowhuman beings. We can find our work engaging, challenging, stimulating, rewarding,and fulfilling. It can provide opportunities for expressing our creativesides, for harnessing our competitive urges, for exhibiting our altruistic spirits,


T. Speed 619and for finding enjoyment in working with others and in solitary pursuits. Ifwe accept all this, then surely we have very good reason to be enthusiasticabout what we do. Why not show it?Not so long ago, I would feel slightly deflated when people complimentedme on my enthusiasm for what I do. I would think “Why are they telling methey like my enthusiasm? Why aren’t they telling me how much they admirethat incredibly original, deep and useful research I was expounding? Is mywork no good, is all they see a crazy person waving his arms around wildly?”I’ve since got over worrying about that, and these days feel very happy ifI am able to convey my enthusiasm to others, especially if I can make themsmile or laugh at the same time. It’s not hard to get a laugh with a weakjoke, but I prefer to do it using a mix of slightly comic enthusiasm, and irony.My experience now is that a strong show of enthusiasm sets the stage forlaughter, which I think is great. Perhaps I’m drifting from a would-be scholarto a would-be entertainer, but we all have it so good, I think we can affordto share our joy with others. Nowadays I’d rather be remembered as a personwho made others laugh in his lectures, than one who impressed everyone withhis scholarship.My summary paraphrases the song popularized by Frank Sinatra (Sinatra,1969). Read and enjoy all the contributions in this book, but “Do it your way.”AcknowledgmentsMany thanks are to Xihong Lin, Jane-Ling Wang, and Bin Yu for their supportiveand helpful comments on earlier drafts of this small essay.ReferencesGalton, F. (1886). Regression towards mediocrity in hereditary stature. TheJournal of the Anthropological Institute of Great Britain and Ireland,15:246–263.Gawande, A. (2002). Complications: A Surgeon’s Notes on an Imperfect Science.Picador, New York.Nagpal, R. (2013). The awesomest 7–year postdoc or: How I learned to stopworrying and learned to love the tenure-track faculty life. Scientific AmericanGuest Blog, July21,2013.Sinatra, F. (1969). “My Way” http://www.youtube.com/watch?v=1t8kAbUg4t4


52Thirteen rulesBradley EfronDepartment of StatisticsStanford University, Stanford, CA52.1 IntroductionWhen I was five or six my father paraded me around the neighborhood asa mental marvel able to multiply three-digit numbers. I think he enjoyed itmore than I did (my savant powers are seriously limited) but it did give mylittle boat a first push into the big river of statistics.So, after all these years, am I grateful for the push? Oh yes (thanks Dad!).Statistics is a uniquely fascinating intellectual discipline, poised uneasily asit is at the triple point of mathematics, philosophy, and science. The fieldhas been growing slowly but steadily in influence for a hundred years, withan increased upward slope during the past few decades. “Buy stat futures”would be my advice to ambitious deans and provosts.At this point I was supposed to come across with some serious advice aboutthe statistical life and how to live it. But a look at some of the other volumeentries made it clear that the advice quota was being well met. (I particularlyenjoyed Hall, Rubin, and Reid’s pieces.) Instead, let me offer some hard-earnedrules garnered from listening to thousands of scholarly presentations.52.2 Thirteen rules for giving a really bad talk1. Don’t plan too carefully, “improv” is the name of the game with technicaltalks.2. Begin by thanking an enormous number of people, including blurry littlepictures if possible. It comes across as humility.621


622 Thirteen rules3. Waste a lot of time at first on some small point, like the correct spellingof “Chebychev.” Who ever heard of running out of time? (See Rule 13.)4. An elaborate outline of the talk to come, phrased in terms the audiencehasn’t heard yet, really sets the stage, and saves saying “I’m going topresent the beginning, the middle, and the end.”5. Don’t give away your simple motivating example early on. That’s likestepping on your own punchline.6. A good way to start is with the most general, abstract statement possible.7. The best notation is the most complete notation — don’t skimp on thosesubscripts!8. Blank space on the screen is wasted space. There should be an icon foreverything — if you say the word “apple,” an apple should tumble in fromthe right, etc. And don’t forget to read every word on the screen out loud.9. Humans are incredibly good at reading tables, so the more rows andcolumns the better. Statements like “you probably can’t make out thesenumbers but they are pretty much what I said” are audience confidencebuilders.10. Don’t speak too clearly. It isn’t necessary for those in the front row.11. Go back and forth rapidly between your slides. That’s what God madecomputers for.12. Try to get across everything you’ve learned in the past year in the fewminutes allotted. These are college grads, right?13. Oh my, you are running out of time. Don’t skip anything, show every slideeven if it’s just for a millisecond. Saying “This is really interesting stuff,I wish I had time for it” will make people grateful for getting “Chebychev”right.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!