_________________________________________________________________ Measurement Copyright © 1996 by H. Goldstein, J.L. Gross, R.E. Pollack and R.B. Blumberg (This is the third chapter of the first volume of The Scientific Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack and Roger B. Blumberg. The Scientific Experience is a textbook originally written for the Columbia course "Theory and Practice of Science". The primary author of "Measurement" is Jonathan L. Gross, and it has been edited and prepared for the Web by Blumberg. It appears at MendelWeb, for non-commercial educational use only. Although you are welcome to download this text, please do not reproduce it without the permission of the authors.) 3.0: Introduction 3.1: Nominal, ordinal, and interval scales 3.2: What observations can you trust? 3.3: From observation to prediction: the role of models 3.4: Obstructions to measurement 3.5: Exercises 3.6: Notes _________________________________________________________________ MEASUREMENT Deciding by observation the amount of a given property a subject possesses is called measurement. In a more liberal sense, the term "measurement" often refers to any instance of systematic observation in a scientific context. Measurement is always an empirical procedure, such as reckoning the mass of an object by weighing it, or evaluating the amount a student has learned by giving an examination. By way of contrast, quantification is a kind of theorizing, such as refining the concept of mass, to explain observed resistance to force. Therefore, it might seem that quantification precedes measurement, or perhaps that after some preliminary observations to develop a method of quantification, one thereafter makes measurements according to that quantification. Such a picture omits much of the hard work typically involved in the creative process. In original scientific investigations, the relationship between quantification and measurement is a "feedback loop". That is, the first set of measurements might suggest that the initial quantification was overly simplistic or even partially wrong, in which case the quantification is appropriately modified. Then some more measurements are made. Perhaps they indicate that not all the problems have been resolved, and that the quantification should be further refined. Then there are more measurements. And so on. The pot of gold at the end of the quantification rainbow is called a "mathematical model". In general, a model is a construction that represents an observed subject or imitates a system. Museums of science often display physical models of atoms and molecules, for example, or of the solar system. Here is the sense in which a mathematical abstraction can be a model: the relationships among the values of the mathematical variables in the abstraction can imitate the relationships among their respective counterparts in the system being modeled. For instance, there are models of the solar system that enable us to predict solar eclipses, far before their occurrence. Such models might be considered valid only to the extent that they predict accurately, independent of the attention given to detail. Analogously, a`valid model of some aspect of the economy would be one that accurately predicts future economic behavior. This chapter describes the steps along the way from empirical observations to the formulation of models. Developing a model is what wins a Nobel Prize. Perfecting the methods of observation is a step along the way, but "scientific truth" is an empirical demonstration that a model is valid. 3.1 Nominal, ordinal, and interval scales We have defined measurement in a sufficiently broad sense that it applies to any procedure that assigns a classification or a value to observed phenomena. Informally, the data that result from a measurement procedure are also called measurements. Collecting data is not often a matter of writing down whatever you see, an infinite task whose outcome would be a few critical observations, completely hidden in an undifferentiated mess of irrelevant details. The collection of data requires structure, including an experimental design and a method of observation. Part of an experimental design is the creation of a "scale" for the measurements, based on the quantification of the phenomena to be observed. There are three major classes of scales, called "nominal" scales, "ordinal" scales, and "interval" scales. A nominal scale is a qualitative categorization according to unordered distinctions. Consider, for example, an attempt to assign a value "male" and "female" while measuring gender. We might say that female is "category 0" and male "category 1", but we might equally well reverse those numeric labels, because they have nothing to do with femaleness or maleness. Another example of a nominal scale is the department an undergraduate chooses for "major" emphasis. Some academic institutions, such as the Massachusetts Institute of Technology, have assigned a number to every department. When asked to identify his or her major, an M.I.T. student is likely to respond something like "Course 8" (which means physics) or"Course 21" (which means humanities).[1] However, the numeric designations are entirely arbitrary. Thus, even at M.I.T., classification of students according to major department is nominal! An ordinal scale makes ranked distinctions. For instance, lexicographic ("dictionary") order is an example of an ordinal scale. It depends on only one property, the sequence of letters in the word. The lexicographic order of words, such as "huge" < "infinite" < "little" < "medium-sized" < "tiny" need not be consistent with any notion of order that is derived from the meanings of the words. Military rank is another example of an ordinal scale. An interval scale is based on the real numbers, so that each unit on the scale expresses the same degree of difference, no matter where on the scale it is located. For instance, weight, length, and duration of time are interval scales. The atomic number of the elements of chemistry is regarded as an interval scale, even though the realizable values are all integers. The difference it expresses between elements one apart is always one proton in the nucleus. The point is that the type of a scale -- nominal, ordinal, or interval -- depends on the underlying nature of the quantification, rather than on the observed existence of particular values, or even on the physical possibility of existence. If the value of zero on an interval scale represents a total absence of the property being measured, then the scale is sometimes called a ratio scale. For instance, whereas Celsius temperature and Kelvin temperature are both interval scales, Kelvin temperature is a ratio scale but Celsius temperature is not. The difference, in classical physics, is that zero on the Kelvin scale means absolute zero, the case in which all motion stops. Here are some multiple-choice questions designed to illustrate what is missing when an interval scale fails to be a ratio scale and that the concepts involved are somewhat subtle. Question 1: Suppose it is 10° Celsius on Sunday and 20° on Monday. Does that make it twice as warm on Monday? Choose only one of the following answers. ( ) yes ) ( ) no The sensible answer to Question 1 is no, of course. It is a mistake in physics to identify the concept of "warmth" with measurements on the Celsius scale. Suppose the temperature drops to 1° C on Tuesday. Was it really twenty times as warm on Sunday? What if it drops to 0.1° C on Wednesday. Was it ten times as warm on Tuesday, and 200 times as warm on Monday? Perhaps by Thursday the temperature drops to -5°. Was it -4 times as warm on Monday as on Thursday? The Fahrenheit scale is another interval scale for temperature that is not a ratio scale. The comparable Fahrenheit readings for Sunday, Monday, Wednesday, and Thursday would be 50° F, 68° F, 32° F, and 22° F. Their ratios make no more sense than ratios on the Celsius scale. The point is that, if a scale is not a ratio scale, then inferences based on the calculation of ratios of its data points may be utterly misleading. Question 2: Using the Celsius temperature measurements given in Question 1 and its answer, is it correct to say that the temperature difference between Monday (20° C) and Sunday (10° C) was about twice as great as the temperature difference between Monday and Wednesday (0.1° C)? ( ) yes ( ) no This time the answer is yes. Even though the Celsius scale is not a ratio scale, when one uses it to measure temperature change, it becomes a ratio scale. When you are measuring relative difference, the value of zero represents no difference; zero is then no longer an arbitrary reference point in the scale, but one with physical and psychological meaning. In particular, the amount of energy required to raise the temperature, say, of one cubic centimeter of water, ten degrees is twice as great as the amount needed to raise it five degrees. Moreover, although finger-dipping estimates are not as precise or reliable as thermometer readings, people can scale gross differences of temperature within a suitable range of human perception. The concepts of nominal, ordinal, and interval scale apply to what is possibly only a single dimension of a multivariate observation. Multidimensional measurements and classifications are common. For instance, we might classify individuals according to age and sex, thereby combining an interval scale on one dimension with a nominal scale on the other. In Chapter 7 we shall discuss some instances when two or more ordinal scales are combined to form a multidimensional scale, the ordinality might partially break down, because subject A might rate higher than subject B on scale 1, while subject B rates higher than subject A on scale 2. 3.2 What observations can you trust? Certain attributes are thought desirable in just about any method of observation. First of all, a method is said to be reliable if its repetition under comparable circumstances yields the same result. In general, unreliable observation methods are not to be trusted in the practice of science, and theories are not accepted on the basis of irreproducible experiments. Reliability is consistency, and it is perhaps the measurement quality most responsible for yielding scientific knowledge that is "public". In the natural sciences, the criterion of reproducibility is frequently easy to meet. For instance, two samples of the same chemical substance are likely to be essentially identical, if they are reasonably free of impurities. By way of contrast, this standard often introduces some problems of experimental design into the medical sciences. You cannot expect to apply the same treatment many times to the same patient, for a host of reasons. Among them, the patient's welfare is at stake, and each treatment might change the patient's health. Nonetheless, in medicine, even when you cannot hope to reuse the exact same subject, there is generally some hope of finding comparable subjects.[2] In some of the social sciences, the problem of achieving reproducibility often seems extreme. The first visit of an anthropologist to a community may well change the community. Similarly, widespread publication of the results of a political poll might induce a change of voter sentiment. Nonetheless, for the findings to be regarded as scientific, whoever publishes them must accept the burden of presenting them in a manner that permits others to test them. The intent to be accurate or impartial is not what makes a discipline a science. Conscientious reporting or disinterested sampling is not the basis for scientific acceptance. Reproducibility is what ultimately counts. Outstanding results are often obtained by persons with a stake in a particular outcome. For instance, a manufacturer of pharmaceuticals might have a large financial stake in the outcome of research into the usefulness and safety of a particular drug. Academic scientists usually have a greater professional benefit to be gained from "positive" results than from "negative" results; it would usually be considered more noteworthy to obtain evidence for the existence of some particle with specified properties than to gather evidence that it does not exist. An accurate measurement method is one that gives the measurer the correct value or very close to it. An unreliable method cannot possibly be accurate, but consistency does not guarantee accuracy. An incorrectly calibrated thermometer might yield highly consistent readings on an interval scale, all inaccurate. By way of analogy, a piece of artillery that consistently overshoots by the same amount is reliable, but inaccurate. Accuracy means being on target. As a practical matter, reliably inaccurate measurements are often preferable to unreliably inaccurate ones, because you might be able to accurately recalibrate the measuring tools. An extreme analogy is that a person who always says "No" for "Yes", and vice versa, gives you far more information, once you understand the pathology, than a person who lies to deceive, or than one who gives random answers. Measurements on an ordinal scale are regarded as reliable if they are consistent with regard to rank. For instance, if we decide that when two soldiers meet, the one who salutes first is the one of higher rank, then our method is highly reliable, because there is a carefully observed rule about who salutes first. Our decision method happens to be completely inaccurate, because it is the person of lower rank who must salute first. If the scale is nominal, we would appraise the reliability of a classificatory method by the likelihood that the subjects of the experiment would be reclassified into the same categories if the method were reapplied. Consider now, the reliability of a classification of living species into either the plant kingdom or the animal kingdom according to whether they were observed to be stationery or mobile. This is not an accurate method, since coral animals are apparently stationery, and sagebrush is seemingly mobile. Moreover, not all living species belong to either the plant kingdom or the animal kingdom, so the problem of discernment is not even well-posed. A further complication is that some species, such as butterflies, have a stationary phase and a mobile phase. However, the issue is only whether whatever method of observation we apply to distinguish between stationariness and mobility produces the same answer repeatedly for members of the same species. We can rate the reliability of a measurement method according to a worst case or to an average case criterion. For interval scale measurements, the reliability is most commonly appraised according to relative absence of discrepancy, and it is given the special name precision. Discrepancy of a millimeter is highly precise if the target is a light year away. It would be overwhelmingly imprecise if we were measuring the size of a molecule. A common measure of precision is the number of significant digits in a decimal representation of the measurement. Associated with accuracy is the concept of validity. We say that a measurement is valid if it measures what it purports to measure. A method that is direct and accurate, such as measuring distance by a correctly calibrated ruler, is always valid. However, when the measurement is indirect, it might be invalid, no matter how consistent or precise it is. For example, suppose we attempted to measure how much of a long passage of text an individual had memorized by the number of complete sentences of that passage that the person could write in ten minutes. Although this measure might be consistent, it is invalid in design, since it gives too much weight to handwriting speed. It is also an invalid measure of handwriting speed, because it gives too much weight to memorization. The phlogiston theory is another example of precise invalidity. Before it was understood that combustion is rapid oxidation, it was thought that when a material ignites, it loses its "phlogiston". The amount of phlogiston lost was reckoned to be the difference in weight between the initial substance and its ash residue. Do you know what accounts for the lost weight? The question of validity is often enshrouded in semantics. For instance, there is a purported method of measuring "intelligence" according to the frequency with which a person uses esoteric words. The burden of proof that what is measured by such a test, or by any other test, is correlated to other kinds of performance commonly associated with the word "intelligence" lies on the designer of that measure. In the absence of proof, using the word "intelligence" for that property is purely semantic inflation of a single trait of relatively minor importance into a widely admired general kind of superiority. Surely the importance of the concept of intelligence depends on its definition as something more than a property measured by such a simplistic test. Indeed it might be preferable not to try to reduce a complex property such as intelligence to a single number. Suppose that someone was far above average in verbal skills and far below average in mathematical skills: would it be reasonable to blur the distinction between that person and someone else who was average in both kinds of skills? Criticism of proposed models -- sometimes based on empirical results, sometimes based on intuition or "thought experiments" --is an important part of scientific activity. Of course, every practicing scientist knows that it is far easier to find flaws or unsupported parts in someone else's theory or experiment, than to design a useful theory or experiment of one's own. 3.3 From observation to prediction: the role of models A contribution to scientific knowledge is commonly judged by the same standard as a contribution in many other practical endeavors: how well does it allow us to predict future events or performance? By way of analogy, an investment counselor's financial advice is worthwhile to the extent that it yields future income, not by how well it would have worked in past markets. Other criteria might be applied in astronomy or geology, for instance, but prediction is the standard criterion. [3] Investment counselors have an incentive to conceal part of the basis for their advice. After all, if you knew everything about how the advice was generated, you could avoid paying the counselor's fee, because you could get answers to your specific questions yourself. In contrast, one of the standard requirements for a contribution to scientific knowledge is that it must include details that permit others to predict future behavior for themselves.[4] The form of a scientific prediction is a mathematical model. Here are some examples. 1. A stone is dropped from a window of every eighth floor of a tall building, that is from the 8th floor, from the 16th, from the 24th, and so on. As a stone is dropped from each floor, a stopwatch is used to measure the amount of time that elapses until it hits the ground. The following table tells the height of the window sill at each floor and the recorded times in the experiment: Floor: 8 16 24 32 40 Distance to Ground (ft.) 96 192 288 384 480 Drop time (sec.) 2.45 3.46 4.24 4.90 5.48 A good model for the relationship between dropping time t and the distance d is the mathematical function: d = 16t² This rule certainly explains the recorded observations. You can easily verify that 16 x 2.45² = 96.04, and so on. We attribute the discrepancy of .04 between the distance of 96.04 that our model associates with the dropping time of 2.45 seconds and the observed distance of 96 as measurement error. All of the other discrepancies are similarly small. For the time being, we will not be concerned with the process by which such incredibly precise measurements were made. 2. By supposing that the dropping time is related to the distance, we are avoiding what is often the hardest part of modeling: separating what is relevant from what is irrelevant. Suppose we had carefully recorded the length of the names of the person who dropped the stone from each floor. Here is another table, showing this previously overlooked relationship. Name of Dropper: YY Sal John David Kathy Name Length: 2 3 4 5 5 Drop time (sec.): 2.45 3.46 4.24 4.90 5.48 If you round the dropping time to the nearest integer, the result is the same as the name length of the person who dropped the stone. In this case the function: namelength = Round-to-Nearest-Integer (dropping time) is a mathematical model for the relationship. How do you decide which model is better? All other things being equal, you would probably prefer the first model for its advantage in precision. However, there is a better way to make the choice, which is to extend the experiment in such a way that the predictions of the two models disagree. For instance, we might send a volunteer named Mike to the 60th floor, whose window is 720 feet from the ground. The first model predicts that the stone will drop in the square-root of (720/16) which is about 6.7 seconds The second model predicts a dropping time of 3.5 to 4.5 seconds. Sometimes, two different models make the same prediction for all of the possible cases of immediate interest. In that instance, the philosophical principle called "Ockham's razor" is applied, and the simpler model is regarded as preferable.[5] The namelength model might seem a trifle too frivolous to have been seriously proposed; more frivolous, say, than the discredited phlogiston theory. However, we can turn it around to make a point. Often the best explanation of a phenomenon is initially in something the experimenter ignores because it is not the right kind of answer. A chemist who is pouring solutions from test tube to test tube might not think that the effect being observed depends on whether the experiment is performed in natural light or artificial light. A physician whose clinical research is primarily on the same day every week might be completely unaware that the patients at the clinic on that day of the week differ from those who are there on other days. When research findings are published, there is an opportunity for other persons to repeat the experiment. If different results are obtained on different occasions, an explanation is required. It is likely that other researchers will think of various ways to extend the experiment. Often they wish not only to test the results of another scientist, but also to have an opportunity to augment the model. Not every mathematical model is given by a closed formula. Here is another example. 3. A pair of cuddly animals is kept together from birth. After one month, there is still only the original pair. However, after two months there is a second pair, after three a third. After four months there are five pairs. The following table shows the number of pairs for the first eight months: Time (months): 0 1 2 3 4 5 6 7 8 Number of Pairs: 1 1 2 3 5 8 13 21 34 Notice that after two months, the number of pairs in any given month is the sum of the numbers of pairs one month ago and two months ago. For instance, the number of pairs at six months is 13, which is the sum of 8 and 5, the numbers at five months and at four months. One possible model for this process is recursive. Let p(n) represent the number of pairs at n months. Then the value of p(n) for any positive integer n is given by this recursively defined function: p(0) = 1 p(1) = 1 p(n) = p(n-1) + p(n-2) if n > 1 The following program calculates p(60), or rather it calculates an approximation to p(60) (the number p(60) is so large that it is represented in the computer in floating point form, with lower order digits truncated). Program Fibonacci[6] 100 DIM P[60] 110 LET P[0] = 1 120 LET P[1] = 1 130 FOR N = 2 TO 60 140 LET P[N] = P[N-1] + P[N-2] 150 NEXT N 200 PRINT P[60] Despite the recursive definition of the Fibonacci function, using a recursive program to calculate Fibonacci numbers is a bad mistake. Here is such a recursive program. Function FIBONACCI (N) FIBONACCI (0) = 1 FIBONACCI (1) = 1 FIBONACCI (N) = FIBONACCI (N-1) + FIBONACCI (N-2) Imagine that you want to calculate the 60th Fibonacci number. The first recursive step: FIBONACCI (60) = FIBONACCI (59) + FIBONACCI (58) "calls itself" twice. That is, its right side requires that the function Fibonacci be calculated for the arguments 59 and 58. Calculating FIBONACCI (59) and FIBONACCI (58) each require two more self calls, thus, an additional four calls. For each of those four calls, there would be two more self-calls. For each of those eight self-calls, there would be two more -- that is, 16 additional calls -- and so on, until the bottom level was reached. Thus, there would be exponentially many calls. Such a phenomenon is known as "exponential recursive descent". You would have to wait a very long time to see the answer. 3.4 Obstructions to measurement Different kinds of obstruction to measurement are encountered in different categories of scientific research. In the physical sciences, perhaps the toughest problem is indirectness. For instance, how do you measure the duration of existence of a particle theoretically supposed to survive for a millionth of a second, assuming it really exists at all? In the biological sciences, limitations on precision are also a formidable obstacle. If you have ever observed a human birth, you know how difficult it would be to assign a precise moment of birth. Of course, most human births are not the subject of scientific research; still, it is interesting to know how the time of birth is decided upon in many cases. Shortly after the infant receives the necessary immediate care, someone says, "Did anyone get the time of birth?", and someone else says, "Yes, it was xx:xx.", based on a rough estimate. While it might be possible in a research study to get somewhat more accurate times than those typically recorded, it would make no sense to try to record them to, say, the tenth of a second. In this section, we will concern ourselves mainly with the frontiers of measurement in the sociological sciences. In addition to severe problems of semantics, competing special interests, and deliberate political obfuscation, there are questions of intrinsic measurability -- that is, whether a putative property can be measured at all. Tax reform is a good example of a truly tangled issue. How do you measure the economic outcome on various classes of individuals that has accrued from an adopted reform? An obvious, oversimplified approach is to compute the tax burden for prototype individuals under both the new system and the old. The omitted complication is that when tax laws change, many individuals and businesses adapt their economic behavior. These primary adaptations have consequences for other taxpayers, who also adapt their behavior, in turn affecting other persons. Moreover, the act of changing tax laws can cause alterations in social values, such as risk-taking behavior, charitable giving, and perception of personal needs. These changes in values can, in turn, have a major effect upon taxation under the reformed system. Even the possibility that major changes in tax laws could occur is likely to make some persons and businesses seek near-term results, instead of looking to the long run. Even after tax laws are changed, it is difficult to distinguish whether subsequent changes in the economy are the result of the tax changes or of something else. There are times when we just don't have a trustworthy way to make a measurement, often because so many factors affect the observable outcome that it is hard to sort out what is due to a particular change. For another preliminary example, consider two persons who are largely in control of what they do at their jobs. Perhaps they are both primarily responsible for administrative work. They are arguing one day about who is "busier" than the other. One says her calendar is packed from morning to night with appointments to keep and things that must be done, and that since there are no breaks in it whatsoever, no one else could possibly be busier without working an even longer day. The other counters that her day is so much busier, she is constantly having extemporaneous meetings and taking phone calls, and she doesn't even have the time for "non-productive" activities like making schedules to protect herself from interruptions. Underlying this slightly frivolous dispute are two competing theories of measurement of busyness. The first administrator is suggesting that if someones time is completely allocated in advance, then that person is 100% busy. The second is suggesting that busyness is related to the number (and also, perhaps, to the intensity) of interruptions. As in the case of tax reform, we have no resolution to propose. The Alabama paradox A less complicated measurement quandary, in the sense that there are fewer factors involved, is caused by the problem of apportionment in the House of Representatives of the United States. To give a simplified illustration of a phenomenon known as the "Alabama paradox", we will reduce it to a model with four states, known as A, B, C, and D, a population of 100 million persons, and 30 representatives to be apportioned. The following table gives the population of each state, and its exact proportion of the population, expressed as a decimal fraction to three places: State Population Proportion A 4,500,000 .045 B 10,700,000 .107 C 31,100,000 .311 D 53,700,000 .537 Totals: 100,000,000 1.000 Since there are 30 representatives, the first step in calculating the fair share for each state is to multiply its proportion of the total population by 30. The next table shows the result of this calculation: State Proportion Exact Fair Share A .045 1.35 B .107 3.215 C .311 9.33 D .537 16.11 Totals: 1.000 30.00 It is obvious that State A should get at least 1 representative, State B at least 3, State C at least 9, and State D at least 16, but that makes only 29 representatives, so there is one leftover. What once seemed obvious to all was that State A should get the remaining representative, because its remaining fraction of 0.35 is the largest fraction among the four states. If there were two remaining representatives after each state got its whole-number share, then the state with the second largest fractional part -- in this case, State C, with 0.33 -- would get the second remaining representative, and so on. Now imagine that the number of representatives is increased from 30 to 31. You might guess that the additional representative will go to State C, but what happens is far more surprising. Here is a new table of exact fair shares, based on 31 representatives: State Proportion Exact Fair Share A .045 1.39 B .107 3.32 C .311 9.64 D .537 16.65 Totals: 1.000 31.00 The integer parts of the fair share calculations remain the same as they were in the 30 representatives case, and they add to a total of 29. However, the two remaining representatives go to State D (with the highest fractional part, at 0.65) and to State C (with the second highest fractional part, at 0.64). Thus, increasing the number of representatives has had the "paradoxical" effect of costing State A one of its representatives. When this first occurred in United States political history, the "victim state" was Alabama, hence the name "Alabama paradox". Later, Maine was a losing state. In the 1970's, two mathematicians, M. Balinski and H. P. Young, devised a different way to achieve "proportional representation", in which each state would still get at least its whole-number share, and such that no state would ever lose a representative to the Alabama paradox. The principal disadvantage of this innovative plan is that scarcely any non-mathematicians are able to understand how to use it. Certain countries, notably France, have resolved the problem by assigning fractional votes to certain representatives. It has also been argued that the Alabama paradox is not unfair, since it merely removes a temporary partially unearned seat from one state and transfers it to another. In practice, however, it is clear that some states have ganged up on others and contrived to increase the number of representatives in such a manner as to deprive particular persons of their seats. The Condorcet paradox . Another issue that has arisen in the political realm presents a deeper problem. Suppose that there are several candidates for the same office, and that each voter ranks them in the order of his or her preference. It seems to stand to reason that if one candidate is a winner, then no other candidate would be preferred by a majority of the voters. Now suppose there were three candidates, known as A, B, and C, a population of 100 voters, and that the following distribution of rank orders was obtained. Ranking Order Number of Voters A > B > C 32 B > C > A 33 C > A > B 35 Total: 100 From this table, it seems to follow that: 1. Candidate C cannot be the winner, since 65% of the voters prefer Candidate B. 2. Candidate B cannot be the winner, since 67% of the voters prefer Candidate A. 3. Candidate A cannot be the winner, since 68% of the voters prefer Candidate C. This paradox is not an indictment of the concept of rank-order voting. Under most common vote-for-one-only systems, a winner would emerge, but not one with a "clear mandate". For instance, if the winner were determined by plurality on the first ballot, then Candidate C would win, despite the fact that 65% really would have preferred Candidate B. Just because the vote-for-one-only system conceals this preference doesn't change the actuality that it is indeed a preference of the voters. If the rules called for elimination of a least favored on each round until one candidate obtained a plurality, then Candidate A would be eliminated on the first round, and Candidate B would get the 32 votes no longer assignable to Candidate A. Thus, Candidate B would win, even though 67% of the voters would have preferred Candidate A. The underlying measurement problem here is as follows: given a rank order preference list supplied by each voter, how do you determine the "collective choice"? The area of applied mathematics in which this problem is studied is called "social choice theory", and one of its practitioners, Kenneth Arrow, has won a Nobel Prize in Economics for showing that no method of choosing a winner could simultaneously satisfy all the reasonable constraints one would wish to impose and still work in every instance. Perhaps the strongest aspect of vote-for-one-only systems is that they are easily understood and difficult to manipulate by voter-bloc discipline. Under political conditions in which voters commonly feel that the problem is choosing the "least undesirable candidate", since no desirable candidate exists, it is sometimes suggested that voters should have the option of voting "no" against their least-liked candidate, where a yes counts +1 and a no counts -1. There is both a "yes-or-no" system, in which one may vote either "yes" for a most-favored candidate or "no" against a least-liked, but not both, and a "yes-and-no" system in which one may make both a "yes" vote and a "no" vote. A prominent advantage of such a system is that "protest-voters" would not end up "wasting" their votes on obscure candidates with no chance of winning. Nonetheless, "no"-vote systems have problems of their own, and this is explored in some of the exercises below. 3.5 Exercises 1. There are many different reasons why a prize might be awarded to a baseball team for performance in league competition. Identify the type of scale to which each of the following possible criteria belongs, and explain your answer. a) number of games won against other teams in the league b) zodiac sign of the shortstop c) distance between the catcher's eyes d) hair color of the centerfielder 2. Jones has an expensive foreign-made analog watch with a jeweled movement that gains at most 12 seconds a month. Every month, Jones resets his watch according to the time announced by the telephone company time service. Smith has an inexpensive digital watch that gains at most 2 seconds a year. However, Smith sets her watch five minutes fast, in order to avoid missing commuter trains. Whose watch is more precise? Whose watch is more accurate? 3. Consider a two-dimensional square of definite size. Among all possible transformations that can be made on this square, there are some that will leave it in a position indistinguishable from its original position (e.g. rotation in the plane by 90 degrees; a rotation of 180 degrees around the diagonal of the square, or a reflection in a two-sided mirror positioned positioned perpendicular to the square and passing through its middle). We call such transformations the symmetry transformations of the square, and say that the more symmetry transformations a system has, the higher its degree of symmetry. What sort of scale is being used in this sort of measurement? Under what conditions would it be sensible to compare the degrees of symmetry of two systems? 4. A conversation is taking place between two members of your college class, and they are talking about their academic standing when they were in high school. One student tells the other that she was 2nd in her graduating class, and the other student remarks "Oh, well, you're about three times as smart as I am; I was only 6th in my class." Explain what's wrong with the second student's reasoning. Would the second student be warranted in asserting instead that the first student had done better than he had in high school? 5. Write a recursive program for Fibonacci numbers. Use your program to calculate the 5th, 10th, 15th, 20th, 25th, and 30th Fibonacci numbers. Use a wrist watch to record the time it takes for each calculation. 6. The text of this chapter would seem to imply that in deciding whether or not a particular discipline is or is not scientific, reproducibility of measurements is what ultimately counts. Do you think this is a reasonable way to demarcate science from "non-science", and/or can you think of other measurement criteria which you think are more important than "reliability" in establishing knowledge claims? Do you think that the criterion of reliability rules out the possibility of scientific status for some disciplines in principle (e.g. economics, psychoanalysis, literary theory)? Please explain your answers carefully. 7. For the given four-state example, show that an increase from 30 representatives to 32 would not have caused an instance of the Alabama paradox, but that an increase from 30 to 34 would. 8. Either construct a two-state example of the Alabama paradox, or prove that it is impossible to do so. 9. Construct a three-state example of the Alabama paradox. 10. A "yes-and-no" voting system can be viewed as a reduction of a preference ranking system, in which all that matters is who each voter likes best and least. If voters favoring a second-most popular candidate cast their "no"-votes against the most popular candidate, instead of a candidate they actually like least, this can sometimes tilt the net tally so that the second-most popular candidate wins. In other words, a "yes-and-no" system can deteriorate into deliberate manipulation. Construct a 4-candidate example in which this phenomenon could occur. 3.6 Notes [1] At M.I.T., literature, art, music, foreign languages, and history are collectively a single department, while there is finer differentiation among the sciences and the engineering disciplines). [2]Deciding which subjects or groups of subjects are comparable raises a number of difficult questions, and this issue is often left to experts. This obviously presents problems if these same experts have a stake in the outcome of a particular research program. [3]One of these other criteria, of course, would be explanation. As you read more about models, you might ask yourself how or whether successful models actually explain, or merely describe the phenomena being modeled. Do you think that explanation, as opposed to description, is very important in science, or is description + prediction the only thing we should be concerned about? Similarly, you might consider whether you think the predictive capability of certain physical theories is the defining characteristic of natural science; that is, consider whether or not you would classify a discipline as "pseudo-science" if the theories of that discipline were not predictive. [4] This is not to say that scientists always comply with this requirement. Sometimes scientists temporarily conceal a few crucial details for a number of months, in the interest of keeping a lead over other researchers, and thereby advancing their own careers. Like investment counselors, some scientists have been known to protect what they perceive as their own self-interest. [5] Named for the 14th century English philosopher, William of Ockham, this is the maxim that the number of assumptions introduced to explain something, or the number of entities postulated by a theory, must not be multiplied beyond necessity. Of course, what constitutes "necessity" is often far from clear, and the criterion of "simplicity" in the evaluation of models is notoriously difficult to specify. [6] You may recognize the name as that which describes the sequence of numbers, each of which is the sum of the two previous numbers. Named for the 13th century Italian mathematician Leonardo Fibonacci ("Leonardo, son of Bonaccio"), these numbers have fascinated mathematicians and scientists alike. In this case, the program Fibonacci instructs the computer to simulate the growth of the variable p(n). _________________________________________________________________ (This is the third chapter of the first volume of The Scientific Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack and Roger B. Blumberg. The Scientific Experience is a textbook originally written for the Columbia course "Theory and Practice of Science". The primary author of "Measurement" is Jonathan L. Gross, and it has been edited and prepared for the Web by Blumberg. It appears at MendelWeb, for non-commercial educational use only. Although you are welcome to download this text, please do not reproduce it without the permission of the authors.) _________________________________________________________________ This document can be found at MendelWeb (http://www.mendelweb.org)