Models for Integrating Statistics in Biology Education Laura Kubatko — The Ohio State University Danny Kaplan — Macalester College Jeﬀ Knisley — East Tennessee State University Models for Integrating Statistics in Biology Education: The Ohio State University Laura Kubatko — The Ohio State University Danny Kaplan — Macalester College Jeﬀ Knisley — East Tennessee State University The Ohio State University • Approximately 38,000 undergraduates on main campus in Columbus, OH • Six biology departments, offering eight distinct majors – 2,300 majors in biological sciences • Also undergraduate programs in medical fields, environmental sciences, etc. • Variability in mathematical and statistical requirements across majors The Ohio State University • Growing presence in mathematical biology – NSF-funded Mathematical Biosciences Institute – Associated faculty hires, joint appointments, etc. – Degree programs under development • M.S. in Mathematical Biology • Track with undergrad Math Major for Mathematical Biology – NSF-funded UBM Program (2008-2013) Development of Curriculum in Mathematical Biology at OSU • At the request of the College of Biological Sciences • Four courses: – Calculus for the Life Sciences I and II – Statistics for the Life Sciences – Mathematical Modeling • Student population: Freshmen life science majors who place into calculus Development of Curriculum in Mathematical Biology at OSU • Goal: All biology majors take calculus, Statistics and Modeling optional • Considerations in designing statistics course: – – – – Build on calculus sequence Satisfy requirements for statistics courses in majors Include analysis of actual data sets Introduce computing Statistics for the Life Sciences • Three 48-minute lectures per week – Traditional lecture format (with some activities) • Two 48-minute labs in computer room (taught by GTA) – Half lab activities, half problem-solving sessions – StatCrunch software used for data analysis • Advantages: Runs in JAVA, easy to use in a 10week course • Disadvantage: Availability after course Statistics for the Life Sciences • Four data sets integrated in lecture and lab throughout the quarter: – Magnetic field data (Barnothy, 1964) – Fisher’s iris data set (Fisher, 1936) – Limnology data (collected in 1993 at the James H. Barrow Field Station) – Forest composition data (collected in 1993 at the James H. Barrow Field Station) Statistics for the Life Sciences • Overview of Topics: – Descriptive statistics, graphical displays (1 week) – Probability, including Bayes Theorem (1 week) – Discrete distributions, analyzing categorical data (2.5 weeks) – One- and two-sample inference for means and variances (2.5 weeks) – Experimental design (1 week) – Correlation and regression ( 1.5 weeks) Successes • Student feedback: course very useful • Use of GTAs in all of the courses: assist in training interdisciplinary teachers • MBI post-docs have been involved in calculus projects • Several students recruited into our UBM program Challenges • Enrollment!! – Not required for any students at present – Shifts in administrative structure in College of Biological Sciences – Decreasing enrollment in Calculus for Life Sciences (only one section next year) – Students have full schedules in their first year Challenges • Experience of students – Freshmen: may only have one or two courses in biology, and often none in genetics • Selection of topics – 10-week course – Balance coverage of fundamental ideas with more current topics The Future • As OSU converts to semesters, work to have these courses included formally in the appropriate places in biology majors • Work more closely with Center for Life Science Education to understand how to integrate this course better with other experiences of these students • Enhance lab activities • Possibly use R for data analysis The Future • More broadly, the math-bio curriculum at OSU continues to grow • UBM program has recently hired its first group of 9 Undergraduate Research Fellows • New course: Undergraduate Seminar in Mathematical Biology • New majors/minors will soon be available More Information • Syllabus, Lab Material available at http://www.stat.osu.edu/~lkubatko/CAUSEwebinar/ Models for Integrating Statistics in Biology Education: Macalester College’s Program Laura Kubatko — The Ohio State University Danny Kaplan — Macalester College Jeﬀ Knisley — East Tennessee State University The Revolution in the Biosciences The biological and medical sciences have changed dramatically in the last 50 half century. • The dominance of molecular biology and genetics. Example: the sequencing of whole genomes. • Dramatic improvements in instrumentation and techniques. Example: DNA microarrays. • The emergence of the clinical trial, large cohort studies, and “scientiﬁc medicine.” Example: The Framingham Heart Study ongoing from 1948. Biology used to be a haven for non-quantitative students with an interest in science. Now it is data-intensive: Large and multivariate. What Statistics Do We Teach? The typical statistics course required by a biology department is: • Single-variable. There is a treatment group and a control group that are alike in every other way. • Emphasizes small data sets. n = 3 is pretty common, perhaps reaching up to n = 20. • Warns about “lurking” or “confounding” variables, but oﬀers no way to deal with them except randomization. We do t-tests and one-way ANOVA, not multiple regression. • Has no university-level mathematics pre-requisite. Example: DNA Microarrays An array of thousands to tens of thousands of small dots of diﬀerent features (DNA oligonucleotides) that can probe which genes are being expressed at a given time. From the Wikipedia article http://en.wikipedia.org/wiki/DNA_microarray.time. DNA Microarrays: Statistics “A basic diﬀerence between microarray data analysis and much traditional biomedical research is the dimensionality of the data. A large clinical study might collect 100 data items per patient for thousands of patients. A medium-size microarray study will obtain many thousands of numbers per sample for perhaps a hundred samples. Many analysis techniques treat each sample as a single point in a space with thousands of dimensions, then attempt by various techniques to reduce the dimensionality of the data to something humans can visualize.” “Experimenters must account for multiple comparisons: even if the statistical P-value assigned to a gene indicates that it is extremely unlikely that diﬀerential expression of this gene was due to random rather than treatment eﬀects, the very high number of genes on an array makes it likely that diﬀerential expression of some genes represent false positives or false negatives.” From the Wikipedia article http: // en. wikipedia. org/ wiki/ DNA_ microarray . From D. M. Windish, S. J. Huot, M. L. Green (2007) ``Medical Resident's Understanding of the Biostatistics and Results in the Medical Literature,'’ JAMA 298 (9): 1010-1017 What do Medical Residents Know about Statistics? Questions & Responses • If we don’t teach biology students about multiple variables and the complications that arise from them, where are they going to learn about this? • Biology students are NOT strong enough mathematically to handle multivariate material. So why to we think they are going to learn it on their own? • You have to learn the basics ﬁrst. Crawl before walking. Walk before running. If the plan is for students to take a “second course” in statistics, is there any evidence that this plan is working? Assumptions We Made in Revising our Introductory Quantitative Curriculum • We would have only two semester courses in which to provide material that students can use to study biology in a sophisticated way. • It was our job to ﬁgure out how to make the material accessible to the students we have. • The technical skills to work with multivariate data are important. • We want students to have a good theoretical understanding of the material, not just technical skills. • Our courses would be suitable for students preparing to take Calc I. No requirement for previous work in calculus. Our Goals • Common foundation for all students, more or less regardless of their earlier preparation. (Students who are ready to take Calc I I I are the exception — they do that, although some opt to take Applied Calculus.) • Provide skills and concepts that are directly and concretely relevant to the follow-up courses students will take in other areas, e.g., biology, economics, ... NOT “this teaches them to think rigorously” • Add value to the student’s existing mathematical knowledge. Not so important to reﬁne that existing knowledge (e.g., learn how to do symbolic integrals) but to EXTEND it in ways that the student would not be able to do on his or her own. (Why do we think that students can learn multivariate stuﬀ on their own?) The Constraints We Faced • Students come from diﬀerent entry points. • Students have to be prepared to do calculusbased physics. • Pre-meds have to have a calculus course. (But there are good reasons to make it calculus, too.) • Statistics had to be accessible to mid-level mathematics students as a stand-alone course. • Students cannot be channelled into a special section or a special course: they typically don’t know their major when they enter. Broad View • In order to teach students about multivariate statistics, we need for them to know something about multivariate functions. So to teach statistics, we also had to teach calculus in a manner that would be useful for statistics. – What a linear approximation looks like. – What a quadratic approximation looks like. (Including interactions.) – What a partial derivative is. • The program had to be organized as two distinct, stand-alone courses: one in calculus and one in statistics. – Some programs require a calculus course, and many students and parents expect a calculus course, so one of the courses would be calculus. This does NOT mean it has to be about the chain rule, the quotient rule, etc. – Some programs require a statistics course, and many students come in with some calculus already, so the statistics course had to be accessible to them. Broad View (cont.) • Macalester is small (1800 students), and students don’t necessarily know their major when they start. So it wouldn’t work to have specialized courses just for biology majors. The new courses had to be suitable for the mainstream student. • We wanted biology students (and others) to get a reasonable introduction to computation. This includes the organization of data and a familiarity with the structure of computer commands. Calculus, Mathematics, and Statistics • Calculus and statistics are taught as if they have little in common. • There are actually very strong connections in terms of modeling and the interpretation of statistical models. • The problem is that students don’t have a language for talking about modeling, change and diﬀerence. So the statistics course is forced to focus on very simple descriptions, e.g., are these group means diﬀerent? • Why? – Calculus topics were almost entirely established BEFORE 1900. Statistics starts AFTER 1900. – Mathematicians usually have no training in statistics whatsoever. The way calculus is taught should change in order to support statistics. Comment on Calculus and Statistics There is a strong link between calculus and statistics, but many people assume that it is about: – Integrating probability densities – Using derivatives to optimize: e.g. ﬁnding the least squares ﬁt. Neither of these is particularly important. Students can understand areas without calculus. Least squares can be completely explained without derivatives. – Approximating relationships with functions (esp. linear and quadratic functions) – Describing rates of change: how one variable changes with another – The idea of partial change: the consequences of changing one variable while holding others constant. – Ideas of linear combinations: subspaces, projections, collinearity, redundancy. Before Bio2010 ... there was CRAFTY • Mathematical Association of America project on “Curriculum Reform in the First Two Years” • A dozen CRAFTY workshops in 1999– 2001 covered a broad range of STEM ﬁelds — biology, chemistry, computer science, engineering in various ﬂavors, mathematics, statistics, physics. • The conclusions reached are remarkably consistent across all disciplines (and, broadly, with Bio2010). CRAFTY Recommendations CRAFTY calls for much greater emphasis on ... • Mathematical modeling, the process of constructing a representation of an object, system, or process that can be manipulated using mathematical operations. • Statistics and data analysis. • Multivariate topics. The reports refer speciﬁcally to two- and three-dimensional topics. Many of the topics mentioned are related to the traditional calculus sequence (including linear algebra, diﬀerential equations, and multivariable calculus) — we’ll refer to these topics as “calculus.” • The appropriate use of computers. • Bio 2010: Transforming Undergraduate Education for Future Research Biologists National Research Council (U.S.). Committee on Undergraduate Biology Education to Prepare Research Scientists for the 21st Century National Academies Press, 2003 Bio2010 & Mathematics/CS RECOMMENDATION #1.5 Quantitative analysis, modeling, and prediction play increasingly signiﬁcant day-to-day roles in today’s biomedical research. To prepare for this sea change in activities, biology majors headed for research careers need to be educated in a more quantitative manner .... The committee recommends that all biology majors master the concepts listed below. — Bio2010, pp. 41-46 Topics are organized by • Calculus • Linear Algebra Dynamical Systems • Probability and Statistics • Information and Computation • Data Structures See appendix for a detailed list Commentary on BIO2010 The recommendations are certainly ambitious and laudable, but… • The recommendations seem to have been formed without any time-budget constraint. • Some of them are vague and there is no prioritization of them. Examples: “the integral”, “integration over multiple variables.” Does this mean the concept of accumulation, or rules for symbolic integration? • The statistics topics are out of line with current thought on “statistical literacy” and “statistical thinking.” • To follow them with the courses currently available at most schools, every biology major would have to major in mathematics as well. • Even though it might be impractical to cover all of the Bio2010 quantitative topics, a good majority can be covered in a coherently organized two-course sequence. Outline of our “Applied Calculus” Course One-semester course. Pre-requisite to “Statistical Modeling.” 1.Modeling basics 2.Derivatives and change 3.Diﬀerential equations (emphasis on phenomena: growth, stability, oscillation) 4.Linear algebra (emphasis on geometry as it applies to statistics) See slides in the appendix Introduction to Statistical Modeling • Organization and (simple) descriptions of data. • Construction of (linear) statistical models. This includes multiple variables and nonlinear terms, esp. interactions. • Adjustment for covariation. The idea of “partial change.” • Inference: – Conﬁdence intervals and the eﬀects of collinearity. – Analysis of Covariance. Central question: Does this variable contribute to the explanation. – Resampling and bootstrapping. • Causation & Experimental design: Randomization, blocking, and orthogonality. • Logistic regression and non-parametrics. For the preface and outline, see http://www.macalester.edu/~kaplan/ISM Example of a Case Study: Nitrogen Fixing by Plants Macalester Biologist Mike Anderson studies the ecology of nitrogen ﬁxing bacteria. Students are given the data he collected in ﬁeld studies of alder bushes in Alaska. • Measured nitrogen ﬁxation. • The genotype ID of the bacteria on each plant’s roots. • The characteristics of the site: e.g., soil temperatures at 1cm and 5cm, water content in soil. • The time in the season when the data were taken. Case Study: Nitrogen Fixing (continued) The analysis involves modeling nitrogen ﬁxation by these other explanatory variables, taking into account the highly non-normal distribution of the nitrogen ﬁxation, and the strong collinearity among the explanatory variables. • Naive models indicate strongly that ﬁxation varies among genotypes (p < 0.001), one-way ANOVA. • Using analysis of covariance, the p-value is reduced even further (p < 0.0001). However, ... • The association with genotype is completely captured by the covariates of site characteristics, especially when non-parametric techniques are used. Approach in Both Courses • Multivariate from the beginning. Let’s us treat F = ma seriously, but also look at interesting biology models, e.g., predator-prey, nerve-cell, SEIR, damped harmonic oscillator, … • De-emphasis on algebraic manipulation. Geometry used: Contours, gradients, directional derivatives, subspaces, ... • Computation integrated into both courses. We use R, a statistics package. • Simulations, e.g., – Motion in the phase plane. – Hypothetical causal networks Some Successes of Our Program • The courses genuinely cover many of the Bio2010 topics. • The courses have been popular with both students and faculty. – Fully one-third of the student body at Macalester takes Applied Calculus. – One-quarter takes Introduction to Statistical Modeling. • Math/Statistics faculty enjoy teaching the courses. • They have become the mainstream courses and are taught by multiple faculty in multiple sections each semester. Some Failures of our Program • The topics, skills, and techniques haven’t been picked up in the downstream biology courses. • We still don’t oﬀer an easy route to a reasonable education in computing. We think we would need to have a three-course sequence in order to do this well. Toward the Future • Introduction to Statistical Modeling – A textbook, exercises, class activities, etc. are available now in draft form and will be published this summer. – Workshops on ISM at the US Conference on Teaching Statistics (Columbus, OH, June 23-25, 2009) and the Joint Mathematics Meetings (San Francisco, January 2010) See www.macalester.edu/~kaplan/ISM • An NSF CCLI Phase 2 proposal: Building a Community around Modeling, Statistics, Computation, and Calculus. See www.macalester.edu/~kaplan/MSCC • The plan is to provide support for faculty who want to develop materials and who want to adopt materials that unify modeling, statistics, computation, and calculus in the quantitative curriculum. Thanks to ... • W.M. Keck Foundation for their support of Introduction to Statistical Modeling through the Keck Data Fluency project grant. • The Howard Hughes Medical Institute, which funded the ﬁrst three years of the project: the original “Calculus with Biological Applications” and “Statistics with Biological Applications.” • The Macalester biology department, esp. Jan Serie, who sponsored the original project and agreed to require their students to take these courses even before they were fully developed. • Other Macalester faculty involved in teaching and developing these courses: Tom Halverson, Karen Saxe, Dan Flath, David Bressoud (current president of the Mathematical Association of America), Victor Addona, Chad Topaz, Andrew Beveridge. APPENDICES • See www.macalester.edu/~kaplan/ISM/CauseMay 2009.pdf Models for Integrating Statistics in Biology Education: The Symbiosis Project East Tennessee State University Laura Kubatko — The Ohio State University Danny Kaplan — Macalester College Jeﬀ Knisley — East Tennessee State University Symbiosis: An Introductory Integrated Mathematics and Biology Curriculum for the 21st Century (HHMI 52005872) • Team-taught by Biologists (6), Mathematicians (3), and Statisticians (1) – Biologists progress to needs for analyses, models, or related concepts (e.g., optimization) – A complete intro stats and calculus curriculum via the needs and contexts provided by the biologists (presentation is primarily about our experiences working with our biologists) Goals of the Symbiosis Project • Implement a large subset of the recommendations of the BIO2010 report in an introductory lab science sequence – Semester 1: Statistics + Precalculus, Limits, Continuity – Semester 2: Completion of a Calculus I course + Statistics (Our focus on Semesters 1 and 2) – Semester 3: Modeling, BioInformatics, reinforcement of previous ideas, More Statistics Goals of the Symbiosis Project • Use Biological contexts to motivate mathematical and statistical concepts and tools – Analysis of data used to inform and interpret – Models and inference used to predict and explain • Use Mathematical concepts and Statistical Inference to produce biological insights – Insights often need to be quantified if only to predict the scale on which the insight is valid – Especially useful are insights that cannot be obtained without resorting to mathematics or statistics Table of Contents • Symbiosis I and II – List of “modules” with topics selected by biologists – Mathematical and Statistical Highlights included (Not enough time to explore Symbiosis III) • Logistics: 5 + 1 format, student populations between 7 and 30, and 3 or 4 faculty per course Symbiosis I 1. The Scientific Method: Numbers, models, binomial, Randomization Test, Intro to Statistical Inference 2. The Cell: Descriptive Statistics and Correlation 3. Size and Scale: Lines, power laws, fractals, Poisson, exponentials, logarithms, and linear regression 4. Mendelian Genetics: Chi-Square, Normal, Goodness of Fit Test, Test of Independence 5. DNA: Conditional Probability, the Markov Property, Sampling distributions 6. Proteins and Evolution: Limits, continuity, approximations, and the t-test Symbiosis II 7. Population Ecology: Derivatives, Rates of Change, Power, Product, Quotient rules, Differential Equations 8. Species-Species Interactions: Chain rule, Properties of the Derivative, Differential Equations Qualitatively, Equilibria, Parameter Estimation 9. Behavioral Ecology: Optimization, curve-sketching, L’hopital’s rule 10. Chronobiology: Trigonometric functions and their derivatives, Periodograms 11. Integration and Plant Growth: Antiderivatives, Definite Integrals, and the Fundamental Theorem 12. Energy and Enzymes: Applications of the Integral, differential equations methods, Nonlinear Regression Major Outcomes • Complete and/or Comprehensive Biological Investigations – Traditional Bio Curriculum: Biological questions pursued to a point short of quantitative analysis – Symbiosis: Data and Models used to explore biological questions and predict answers • Mendelian genetics via chi-square analysis of data • rK strategists based on logistic model and its solutions, including N(t) = K as an equilibrium solution Aspects of Integration • Biologists need or can use almost all the math and stats we can provide – But their goals are radically different • Statistical inference as a tool for justifying classification of organisms into different categories • Models as a means of separating different phenomena – And the results are used to address their (often non-quantitative) questions • E.g.: Simple epidemiological models used to suggest whether or not mosquito’s can carry the aids virus Aspects of Integration • Statisticians and Mathematicians can contribute to biology in a variety of ways – But transparency is paramount • Examples of techniques “Transparent” to our biologists: The Randomization test, Chi-square, Periodograms, Nonlinear Regression, phase-plane analysis – Or time/effort must be devoted to importance of subtleties within biological contexts • Example: Logarithms and exponentials with base e. (Why not just use base 10 for everything?) Observation • The issues preventing “downstream” usage of math and stats by biologists and their students – Start as small issues at the most elementary levels • Nearly all of Symbiosis module 1 addresses the difference between a scientific hypothesis and a statistical hypothesis • Surface area to volume ratio: First we must agree on notation. • Is a math idea that holds for an arbitrary f(x) also always true for a population with density N(t) at time t? – And grow into major obstacles • E.g.: If time is not spent exploring what a biologist means by a population density, ecological models may become impossible to interpret biologically. • Statistical results are useless if based on invalid assumptions (e.g., populations of same species may differ quantitatively) Further Insights • Computing and Computational Science have emerged as major components – Informatics, genetics, proteomics, … – And Even in Ecology! • Programming in R – Need is for math/stat informed algorithms – Not for elaborate structures or sophisticated programming languages Further Insights • Logistics are a challenge – Transcripts are important!!! – Course sizes / delivery methods differ significantly • Biology lectures can be huge • Biology labs are typically smaller than math/stat sections • (I had never had to consider how to combine a lab grade with a lecture grade) • Communication is very important, especially about the “little issues” that tend to grow Future Directions for Symbiosis • An “Integrated Courses” model – Separate Math/Biology courses • Better for transcript • Allows familiar examination techniques – Common Curriculum • Same materials as 5+1 courses • Biology section maintains the lab component • This is a re-constitution of Symbiosis, not a replacement for it!!! – i.e., a better (logistically, in particular) approach to what we are doing now Future Directions for Symbiosis • More emphasis on computation – Algorithms as method to address biological inquiries – Algorithms as statistical tools • • • • Inference via bootstrapping, Predictions via clustering Informatics Avoiding reliance on “off-the-shelf” approaches • Symbiosis IV: A Gen Ed “Intro to Computational Science” course for math and bio majors Thank you! Any questions

Descargar
# Models for Integrating Statistics in Biology Education