_________________________________________________________________
   
Measurement

   Copyright © 1996 by H. Goldstein, J.L. Gross, R.E. Pollack and R.B.
   Blumberg
   
   (This is the third chapter of the first volume of The Scientific
   Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack
   and Roger B. Blumberg. The Scientific Experience is a textbook
   originally written for the Columbia course "Theory and Practice of
   Science". The primary author of "Measurement" is Jonathan L. Gross,
   and it has been edited and prepared for the Web by Blumberg. It
   appears at MendelWeb, for non-commercial educational use only.
   Although you are welcome to download this text, please do not
   reproduce it without the permission of the authors.)
   

          3.0: Introduction
          3.1: Nominal, ordinal, and interval scales
          3.2: What observations can you trust?
          3.3: From observation to prediction: the role of models
          3.4: Obstructions to measurement
          3.5: Exercises
          3.6: Notes
            
   
     _________________________________________________________________
   
     MEASUREMENT
  
   
   Deciding by observation the amount of a given property a subject
   possesses is called measurement. In a more liberal sense, the term
   "measurement" often refers to any instance of systematic observation
   in a scientific context. Measurement is always an empirical procedure,
   such as reckoning the mass of an object by weighing it, or evaluating
   the amount a student has learned by giving an examination.
   
   By way of contrast, quantification is a kind of theorizing, such as
   refining the concept of mass, to explain observed resistance to force.
   Therefore, it might seem that quantification precedes measurement, or
   perhaps that after some preliminary observations to develop a method
   of quantification, one thereafter makes measurements according to that
   quantification. Such a picture omits much of the hard work typically
   involved in the creative process.
   
   In original scientific investigations, the relationship between
   quantification and measurement is a "feedback loop". That is, the
   first set of measurements might suggest that the initial
   quantification was overly simplistic or even partially wrong, in which
   case the quantification is appropriately modified. Then some more
   measurements are made. Perhaps they indicate that not all the problems
   have been resolved, and that the quantification should be further
   refined. Then there are more measurements. And so on.
   
   The pot of gold at the end of the quantification rainbow is called a
   "mathematical model". In general, a model is a construction that
   represents an observed subject or imitates a system. Museums of
   science often display physical models of atoms and molecules, for
   example, or of the solar system. Here is the sense in which a
   mathematical abstraction can be a model: the relationships among the
   values of the mathematical variables in the abstraction can imitate
   the relationships among their respective counterparts in the system
   being modeled.
   
   
   For instance, there are models of the solar system that enable us to
   predict solar eclipses, far before their occurrence. Such models might
   be considered valid only to the extent that they predict accurately,
   independent of the attention given to detail. Analogously, a`valid
   model of some aspect of the economy would be one that accurately
   predicts future economic behavior.
   
   
   This chapter describes the steps along the way from empirical
   observations to the formulation of models. Developing a model is what
   wins a Nobel Prize. Perfecting the methods of observation is a step
   along the way, but "scientific truth" is an empirical demonstration
   that a model is valid.
   
   
   3.1 Nominal, ordinal, and interval scales
   
   
   We have defined measurement in a sufficiently broad sense that it
   applies to any procedure that assigns a classification or a value to
   observed phenomena. Informally, the data that result from a
   measurement procedure are also called measurements.
   
   
   Collecting data is not often a matter of writing down whatever you
   see, an infinite task whose outcome would be a few critical
   observations, completely hidden in an undifferentiated mess of
   irrelevant details. The collection of data requires structure,
   including an experimental design and a method of observation.
   
   
   Part of an experimental design is the creation of a "scale" for the
   measurements, based on the quantification of the phenomena to be
   observed. There are three major classes of scales, called "nominal"
   scales, "ordinal" scales, and "interval" scales.
   
   
   A nominal scale is a qualitative categorization according to unordered
   distinctions. Consider, for example, an attempt to assign a value
   "male" and "female" while measuring gender. We might say that female
   is "category 0" and male "category 1", but we might equally well
   reverse those numeric labels, because they have nothing to do with
   femaleness or maleness.
   
   
   Another example of a nominal scale is the department an undergraduate
   chooses for "major" emphasis. Some academic institutions, such as the
   Massachusetts Institute of Technology, have assigned a number to every
   department. When asked to identify his or her major, an M.I.T. student
   is likely to respond something like "Course 8" (which means physics)
   or"Course 21" (which means humanities).[1] However, the numeric
   designations are entirely arbitrary. Thus, even at M.I.T.,
   classification of students according to major department is nominal!
   
   
   An ordinal scale makes ranked distinctions. For instance,
   lexicographic ("dictionary") order is an example of an ordinal scale.
   It depends on only one property, the sequence of letters in the word.
   The lexicographic order of words, such as
   
   
   "huge" < "infinite" < "little" < "medium-sized" < "tiny"
       
   
   need not be consistent with any notion of order that is derived from
   the meanings of the words. Military rank is another example of an
   ordinal scale.
   
   
   An interval scale is based on the real numbers, so that each unit on
   the scale expresses the same degree of difference, no matter where on
   the scale it is located. For instance, weight, length, and duration of
   time are interval scales.
   
   
   The atomic number of the elements of chemistry is regarded as an
   interval scale, even though the realizable values are all integers.
   The difference it expresses between elements one apart is always one
   proton in the nucleus. The point is that the type of a scale --
   nominal, ordinal, or interval -- depends on the underlying nature of
   the quantification, rather than on the observed existence of
   particular values, or even on the physical possibility of existence.
   
   
   If the value of zero on an interval scale represents a total absence
   of the property being measured, then the scale is sometimes called a
   ratio scale. For instance, whereas Celsius temperature and Kelvin
   temperature are both interval scales, Kelvin temperature is a ratio
   scale but Celsius temperature is not. The difference, in classical
   physics, is that zero on the Kelvin scale means absolute zero, the
   case in which all motion stops. Here are some multiple-choice
   questions designed to illustrate what is missing when an interval
   scale fails to be a ratio scale and that the concepts involved are
   somewhat subtle.
   
   
   Question 1: Suppose it is 10° Celsius on Sunday and 20° on Monday.
   Does that make it twice as warm on Monday? Choose only one of the
   following answers.
   
                             ( ) yes ) ( ) no 
   
   
   The sensible answer to Question 1 is no, of course. It is a mistake in
   physics to identify the concept of "warmth" with measurements on the
   Celsius scale. Suppose the temperature drops to 1° C on Tuesday. Was
   it really twenty times as warm on Sunday? What if it drops to 0.1° C
   on Wednesday. Was it ten times as warm on Tuesday, and 200 times as
   warm on Monday? Perhaps by Thursday the temperature drops to -5°. Was
   it -4 times as warm on Monday as on Thursday?
   
   
   The Fahrenheit scale is another interval scale for temperature that is
   not a ratio scale. The comparable Fahrenheit readings for Sunday,
   Monday, Wednesday, and Thursday would be 50° F, 68° F, 32° F, and 22°
   F. Their ratios make no more sense than ratios on the Celsius scale.
   
   
   The point is that, if a scale is not a ratio scale, then inferences
   based on the calculation of ratios of its data points may be utterly
   misleading.
   
   
   Question 2: Using the Celsius temperature measurements given in
   Question 1 and its answer, is it correct to say that the temperature
   difference between Monday (20° C) and Sunday (10° C) was about twice
   as great as the temperature difference between Monday and Wednesday
   (0.1° C)?
   
                             ( ) yes  ( ) no 
   

   This time the answer is yes. Even though the Celsius scale is not a
   ratio scale, when one uses it to measure temperature change, it
   becomes a ratio scale. When you are measuring relative difference, the
   value of zero represents no difference; zero is then no longer an
   arbitrary reference point in the scale, but one with physical and
   psychological meaning. In particular, the amount of energy required to
   raise the temperature, say, of one cubic centimeter of water, ten
   degrees is twice as great as the amount needed to raise it five
   degrees. Moreover, although finger-dipping estimates are not as
   precise or reliable as thermometer readings, people can scale gross
   differences of temperature within a suitable range of human
   perception.
   
   
   The concepts of nominal, ordinal, and interval scale apply to what is
   possibly only a single dimension of a multivariate observation.
   Multidimensional measurements and classifications are common. For
   instance, we might classify individuals according to age and sex,
   thereby combining an interval scale on one dimension with a nominal
   scale on the other. In Chapter 7 we shall discuss some instances when
   two or more ordinal scales are combined to form a multidimensional
   scale, the ordinality might partially break down, because subject A
   might rate higher than subject B on scale 1, while subject B rates
   higher than subject A on scale 2.
   
   
   3.2 What observations can you trust? 
   
   Certain attributes are thought desirable in just about any method of
   observation. First of all, a method is said to be reliable if its
   repetition under comparable circumstances yields the same result. In
   general, unreliable observation methods are not to be trusted in the
   practice of science, and theories are not accepted on the basis of
   irreproducible experiments. Reliability is consistency, and it is
   perhaps the measurement quality most responsible for yielding
   scientific knowledge that is "public".
   
   
   In the natural sciences, the criterion of reproducibility is
   frequently easy to meet. For instance, two samples of the same
   chemical substance are likely to be essentially identical, if they are
   reasonably free of impurities. By way of contrast, this standard often
   introduces some problems of experimental design into the medical
   sciences. You cannot expect to apply the same treatment many times to
   the same patient, for a host of reasons. Among them, the patient's
   welfare is at stake, and each treatment might change the patient's
   health. Nonetheless, in medicine, even when you cannot hope to reuse
   the exact same subject, there is generally some hope of finding
   comparable subjects.[2]
   
   
   In some of the social sciences, the problem of achieving
   reproducibility often seems extreme. The first visit of an
   anthropologist to a community may well change the community.
   Similarly, widespread publication of the results of a political poll
   might induce a change of voter sentiment. Nonetheless, for the
   findings to be regarded as scientific, whoever publishes them must
   accept the burden of presenting them in a manner that permits others
   to test them.
   
   
   The intent to be accurate or impartial is not what makes a discipline
   a science. Conscientious reporting or disinterested sampling is not
   the basis for scientific acceptance. Reproducibility is what
   ultimately counts. Outstanding results are often obtained by persons
   with a stake in a particular outcome. For instance, a manufacturer of
   pharmaceuticals might have a large financial stake in the outcome of
   research into the usefulness and safety of a particular drug. Academic
   scientists usually have a greater professional benefit to be gained
   from "positive" results than from "negative" results; it would usually
   be considered more noteworthy to obtain evidence for the existence of
   some particle with specified properties than to gather evidence that
   it does not exist.
   
   
   An accurate measurement method is one that gives the measurer the
   correct value or very close to it. An unreliable method cannot
   possibly be accurate, but consistency does not guarantee accuracy. An
   incorrectly calibrated thermometer might yield highly consistent
   readings on an interval scale, all inaccurate. By way of analogy, a
   piece of artillery that consistently overshoots by the same amount is
   reliable, but inaccurate. Accuracy means being on target.
   
   
   As a practical matter, reliably inaccurate measurements are often
   preferable to unreliably inaccurate ones, because you might be able to
   accurately recalibrate the measuring tools. An extreme analogy is that
   a person who always says "No" for "Yes", and vice versa, gives you far
   more information, once you understand the pathology, than a person who
   lies to deceive, or than one who gives random answers.
   
   
   Measurements on an ordinal scale are regarded as reliable if they are
   consistent with regard to rank. For instance, if we decide that when
   two soldiers meet, the one who salutes first is the one of higher
   rank, then our method is highly reliable, because there is a carefully
   observed rule about who salutes first. Our decision method happens to
   be completely inaccurate, because it is the person of lower rank who
   must salute first.
   
   
   If the scale is nominal, we would appraise the reliability of a
   classificatory method by the likelihood that the subjects of the
   experiment would be reclassified into the same categories if the
   method were reapplied.
   
   
   Consider now, the reliability of a classification of living species
   into either the plant kingdom or the animal kingdom according to
   whether they were observed to be stationery or mobile. This is not an
   accurate method, since coral animals are apparently stationery, and
   sagebrush is seemingly mobile. Moreover, not all living species belong
   to either the plant kingdom or the animal kingdom, so the problem of
   discernment is not even well-posed. A further complication is that
   some species, such as butterflies, have a stationary phase and a
   mobile phase. However, the issue is only whether whatever method of
   observation we apply to distinguish between stationariness and
   mobility produces the same answer repeatedly for members of the same
   species.
   
   
   We can rate the reliability of a measurement method according to a
   worst case or to an average case criterion. For interval scale
   measurements, the reliability is most commonly appraised according to
   relative absence of discrepancy, and it is given the special name
   precision. Discrepancy of a millimeter is highly precise if the
   target is a light year away. It would be overwhelmingly imprecise if
   we were measuring the size of a molecule. A common measure of
   precision is the number of significant digits in a decimal
   representation of the measurement.
   
   
   Associated with accuracy is the concept of validity. We say that a
   measurement is valid if it measures what it purports to measure. A
   method that is direct and accurate, such as measuring distance by a
   correctly calibrated ruler, is always valid. However, when the
   measurement is indirect, it might be invalid, no matter how consistent
   or precise it is.
   
   
   For example, suppose we attempted to measure how much of a long
   passage of text an individual had memorized by the number of complete
   sentences of that passage that the person could write in ten minutes.
   Although this measure might be consistent, it is invalid in design,
   since it gives too much weight to handwriting speed. It is also an
   invalid measure of handwriting speed, because it gives too much weight
   to memorization.
   
   
   The phlogiston theory is another example of precise invalidity. Before
   it was understood that combustion is rapid oxidation, it was thought
   that when a material ignites, it loses its "phlogiston". The amount of
   phlogiston lost was reckoned to be the difference in weight between
   the initial substance and its ash residue. Do you know what accounts
   for the lost weight?
   
   
   The question of validity is often enshrouded in semantics. For
   instance, there is a purported method of measuring "intelligence"
   according to the frequency with which a person uses esoteric words.
   The burden of proof that what is measured by such a test, or by any
   other test, is correlated to other kinds of performance commonly
   associated with the word "intelligence" lies on the designer of that
   measure.
   
   
   In the absence of proof, using the word "intelligence" for that
   property is purely semantic inflation of a single trait of relatively
   minor importance into a widely admired general kind of superiority.
   Surely the importance of the concept of intelligence depends on its
   definition as something more than a property measured by such a
   simplistic test. Indeed it might be preferable not to try to reduce a
   complex property such as intelligence to a single number. Suppose that
   someone was far above average in verbal skills and far below average
   in mathematical skills: would it be reasonable to blur the distinction
   between that person and someone else who was average in both kinds of
   skills?
   
   
   Criticism of proposed models -- sometimes based on empirical results,
   sometimes based on intuition or "thought experiments" --is an
   important part of scientific activity. Of course, every practicing
   scientist knows that it is far easier to find flaws or unsupported
   parts in someone else's theory or experiment, than to design a useful
   theory or experiment of one's own.
   
   
   3.3 From observation to prediction: the role of models
   
   A contribution to scientific knowledge is commonly judged by the same
   standard as a contribution in many other practical endeavors: how well
   does it allow us to predict future events or performance? By way of
   analogy, an investment counselor's financial advice is worthwhile to
   the extent that it yields future income, not by how well it would have
   worked in past markets. Other criteria might be applied in astronomy
   or geology, for instance, but prediction is the standard
   criterion. [3]
   
   
   Investment counselors have an incentive to conceal part of the basis
   for their advice. After all, if you knew everything about how the
   advice was generated, you could avoid paying the counselor's fee,
   because you could get answers to your specific questions yourself. In
   contrast, one of the standard requirements for a contribution to
   scientific knowledge is that it must include details that permit
   others to predict future behavior for themselves.[4]
   
   
   The form of a scientific prediction is a mathematical model. Here are
   some examples.
   
   
   1. A stone is dropped from a window of every eighth floor of a tall
   building, that is from the 8th floor, from the 16th, from the 24th,
   and so on. As a stone is dropped from each floor, a stopwatch is used
   to measure the amount of time that elapses until it hits the ground.
   The following table tells the height of the window sill at each floor
   and the recorded times in the experiment:
     
   
   Floor:                      8    16   24   32   40
   Distance to Ground (ft.)   96   192  288  384  480
   Drop time (sec.)         2.45  3.46 4.24 4.90 5.48
       
     
   A good model for the relationship between dropping time t and the
   distance d is the mathematical function:
   
   d = 16t²
   
   This rule certainly explains the recorded observations. You can easily
   verify that 16 x 2.45² = 96.04, and so on.
   
   
   We attribute the discrepancy of .04 between the distance of 96.04 that
   our model associates with the dropping time of 2.45 seconds and the
   observed distance of 96 as measurement error. All of the other
   discrepancies are similarly small. For the time being, we will not be
   concerned with the process by which such incredibly precise
   measurements were made.
   
   
   2. By supposing that the dropping time is related to the distance, we
   are avoiding what is often the hardest part of modeling: separating
   what is relevant from what is irrelevant. Suppose we had carefully
   recorded the length of the names of the person who dropped the stone
   from each floor. Here is another table, showing this previously
   overlooked relationship.
   
   
   Name of Dropper:   YY    Sal    John   David   Kathy
   Name Length:        2     3       4      5       5
   Drop time (sec.): 2.45  3.46    4.24   4.90    5.48
   
   
   If you round the dropping time to the nearest integer, the result is
   the same as the name length of the person who dropped the stone. In
   this case the function:
   
   namelength = Round-to-Nearest-Integer (dropping time) 
   
   is a mathematical model for the relationship.
   
   
   How do you decide which model is better? All other things being equal,
   you would probably prefer the first model for its advantage in
   precision. However, there is a better way to make the choice, which is
   to extend the experiment in such a way that the predictions of the two
   models disagree.
   
   
   For instance, we might send a volunteer named Mike to the 60th floor,
   whose window is 720 feet from the ground. The first model predicts
   that the stone will drop in
   
   the square-root of (720/16) which is about 6.7 seconds
   
   The second model predicts a dropping time of 3.5 to 4.5 seconds.
   
      
   Sometimes, two different models make the same prediction for all of
   the possible cases of immediate interest. In that instance, the
   philosophical principle called "Ockham's razor" is applied, and the
   simpler model is regarded as preferable.[5]
      
   
   The namelength model might seem a trifle too frivolous to have been
   seriously proposed; more frivolous, say, than the discredited
   phlogiston theory. However, we can turn it around to make a point.
   Often the best explanation of a phenomenon is initially in something
   the experimenter ignores because it is not the right kind of answer.
      
   
   A chemist who is pouring solutions from test tube to test tube might
   not think that the effect being observed depends on whether the
   experiment is performed in natural light or artificial light. A
   physician whose clinical research is primarily on the same day every
   week might be completely unaware that the patients at the clinic on
   that day of the week differ from those who are there on other days.
   
   
   When research findings are published, there is an opportunity for
   other persons to repeat the experiment. If different results are
   obtained on different occasions, an explanation is required. It is
   likely that other researchers will think of various ways to extend the
   experiment. Often they wish not only to test the results of another
   scientist, but also to have an opportunity to augment the model.
   
   
   Not every mathematical model is given by a closed formula. Here is
   another example.
   
   
   3. A pair of cuddly animals is kept together from birth. After one
   month, there is still only the original pair. However, after two
   months there is a second pair, after three a third. After four months
   there are five pairs. The following table shows the number of pairs
   for the first eight months:
   
   
   Time (months):    0   1   2   3   4   5   6   7   8
   Number of Pairs:  1   1   2   3   5   8  13  21  34
   
   
   Notice that after two months, the number of pairs in any given month
   is the sum of the numbers of pairs one month ago and two months ago.
   For instance, the number of pairs at six months is 13, which is the
   sum of 8 and 5, the numbers at five months and at four months.
   
   
   One possible model for this process is recursive. Let p(n) represent
   the number of pairs at n months. Then the value of p(n) for any
   positive integer n is given by this recursively defined function:
   
   
       p(0) = 1
       
       p(1) = 1
       
       p(n) = p(n-1) + p(n-2) if n > 1
       
   
   The following program calculates p(60), or rather it calculates an
   approximation to p(60) (the number p(60) is so large that it is
   represented in the computer in floating point form, with lower order
   digits truncated).
   
   Program Fibonacci[6]
       
       100 DIM P[60]
       110 LET P[0] = 1
       120 LET P[1] = 1
       130 FOR N = 2 TO 60
       140 LET P[N] = P[N-1] + P[N-2]
       150 NEXT N
       200 PRINT P[60]
       
   
   Despite the recursive definition of the Fibonacci function, using a
   recursive program to calculate Fibonacci numbers is a bad mistake.
   Here is such a recursive program.
   
   Function FIBONACCI (N)
       FIBONACCI (0) = 1
       FIBONACCI (1) = 1
       FIBONACCI (N) = FIBONACCI (N-1) + FIBONACCI (N-2)
       
   
   Imagine that you want to calculate the 60th Fibonacci number. The
   first recursive step:
   
   FIBONACCI (60) = FIBONACCI (59) + FIBONACCI (58)
   
   
   "calls itself" twice. That is, its right side requires that the
   function Fibonacci be calculated for the arguments 59 and 58.
   Calculating FIBONACCI (59) and FIBONACCI (58) each require two more
   self calls, thus, an additional four calls. For each of those four
   calls, there would be two more self-calls. For each of those eight
   self-calls, there would be two more -- that is, 16 additional calls --
   and so on, until the bottom level was reached. Thus, there would be
   exponentially many calls. Such a phenomenon is known as "exponential
   recursive descent". You would have to wait a very long time to see the
   answer.
   
   3.4 Obstructions to measurement
   
   Different kinds of obstruction to measurement are encountered in
   different categories of scientific research. In the physical sciences,
   perhaps the toughest problem is indirectness. For instance, how do you
   measure the duration of existence of a particle theoretically supposed
   to survive for a millionth of a second, assuming it really exists at
   all?
   
   
   In the biological sciences, limitations on precision are also a
   formidable obstacle. If you have ever observed a human birth, you know
   how difficult it would be to assign a precise moment of birth. Of
   course, most human births are not the subject of scientific research;
   still, it is interesting to know how the time of birth is decided upon
   in many cases. Shortly after the infant receives the necessary
   immediate care, someone says, "Did anyone get the time of birth?", and
   someone else says, "Yes, it was xx:xx.", based on a rough estimate.
   While it might be possible in a research study to get somewhat more
   accurate times than those typically recorded, it would make no sense
   to try to record them to, say, the tenth of a second.
   
   
   In this section, we will concern ourselves mainly with the frontiers
   of measurement in the sociological sciences. In addition to severe
   problems of semantics, competing special interests, and deliberate
   political obfuscation, there are questions of intrinsic measurability
   -- that is, whether a putative property can be measured at all.
   
   
   Tax reform is a good example of a truly tangled issue. How do you
   measure the economic outcome on various classes of individuals that
   has accrued from an adopted reform? An obvious, oversimplified
   approach is to compute the tax burden for prototype individuals under
   both the new system and the old. The omitted complication is that when
   tax laws change, many individuals and businesses adapt their economic
   behavior. These primary adaptations have consequences for other
   taxpayers, who also adapt their behavior, in turn affecting other
   persons.
   
   
   Moreover, the act of changing tax laws can cause alterations in social
   values, such as risk-taking behavior, charitable giving, and
   perception of personal needs. These changes in values can, in turn,
   have a major effect upon taxation under the reformed system. Even the
   possibility that major changes in tax laws could occur is likely to
   make some persons and businesses seek near-term results, instead of
   looking to the long run.
   
   
   Even after tax laws are changed, it is difficult to distinguish
   whether subsequent changes in the economy are the result of the tax
   changes or of something else. There are times when we just don't have
   a trustworthy way to make a measurement, often because so many factors
   affect the observable outcome that it is hard to sort out what is due
   to a particular change.
   
   
   For another preliminary example, consider two persons who are largely
   in control of what they do at their jobs. Perhaps they are both
   primarily responsible for administrative work. They are arguing one
   day about who is "busier" than the other. One says her calendar is
   packed from morning to night with appointments to keep and things that
   must be done, and that since there are no breaks in it whatsoever, no
   one else could possibly be busier without working an even longer day.
   The other counters that her day is so much busier, she is constantly
   having extemporaneous meetings and taking phone calls, and she doesn't
   even have the time for "non-productive" activities like making
   schedules to protect herself from interruptions. Underlying this
   slightly frivolous dispute are two competing theories of measurement
   of busyness. The first administrator is suggesting that if someones
   time is completely allocated in advance, then that person is 100%
   busy. The second is suggesting that busyness is related to the number
   (and also, perhaps, to the intensity) of interruptions. As in the case
   of tax reform, we have no resolution to propose.
   
   The Alabama paradox A less complicated measurement quandary, in the
   sense that there are fewer factors involved, is caused by the problem
   of apportionment in the House of Representatives of the United States.
   To give a simplified illustration of a phenomenon known as the
   "Alabama paradox", we will reduce it to a model with four states,
   known as A, B, C, and D, a population of 100 million persons, and 30
   representatives to be apportioned. The following table gives the
   population of each state, and its exact proportion of the population,
   expressed as a decimal fraction to three places:
   
   
              State    Population    Proportion

                A       4,500,000       .045

                B      10,700,000       .107

                C      31,100,000       .311

                D      53,700,000       .537

             Totals:  100,000,000      1.000
   
   
   Since there are 30 representatives, the first step in calculating the
   fair share for each state is to multiply its proportion of the total
   population by 30. The next table shows the result of this calculation:
   
   
         State      Proportion     Exact Fair Share

           A          .045              1.35

           B          .107              3.215

           C          .311              9.33

           D          .537             16.11

        Totals:      1.000             30.00

       
   It is obvious that State A should get at least 1 representative, State
   B at least 3, State C at least 9, and State D at least 16, but that
   makes only 29 representatives, so there is one leftover. What once
   seemed obvious to all was that State A should get the remaining
   representative, because its remaining fraction of 0.35 is the largest
   fraction among the four states. If there were two remaining
   representatives after each state got its whole-number share, then the
   state with the second largest fractional part -- in this case, State
   C, with 0.33 -- would get the second remaining representative, and so
   on.
   
   
   Now imagine that the number of representatives is increased from 30 to
   31. You might guess that the additional representative will go to
   State C, but what happens is far more surprising. Here is a new table
   of exact fair shares, based on 31 representatives:
   
   
          State      Proportion     Exact Fair Share

            A          .045               1.39

            B          .107               3.32

            C          .311               9.64

            D          .537              16.65

          Totals:     1.000              31.00
   
   
   The integer parts of the fair share calculations remain the same as
   they were in the 30 representatives case, and they add to a total of
   29. However, the two remaining representatives go to State D (with the
   highest fractional part, at 0.65) and to State C (with the second
   highest fractional part, at 0.64). Thus, increasing the number of
   representatives has had the "paradoxical" effect of costing State A
   one of its representatives.
   
   
   When this first occurred in United States political history, the
   "victim state" was Alabama, hence the name "Alabama paradox". Later,
   Maine was a losing state.
   
   
   In the 1970's, two mathematicians, M. Balinski and H. P. Young,
   devised a different way to achieve "proportional representation", in
   which each state would still get at least its whole-number share, and
   such that no state would ever lose a representative to the Alabama
   paradox. The principal disadvantage of this innovative plan is that
   scarcely any non-mathematicians are able to understand how to use it.
   
   
   Certain countries, notably France, have resolved the problem by
   assigning fractional votes to certain representatives. It has also
   been argued that the Alabama paradox is not unfair, since it merely
   removes a temporary partially unearned seat from one state and
   transfers it to another. In practice, however, it is clear that some
   states have ganged up on others and contrived to increase the number
   of representatives in such a manner as to deprive particular persons
   of their seats.
   
   
   The Condorcet paradox . Another issue that has arisen in the political
   realm presents a deeper problem. Suppose that there are several
   candidates for the same office, and that each voter ranks them in the
   order of his or her preference. It seems to stand to reason that if
   one candidate is a winner, then no other candidate would be preferred
   by a majority of the voters. Now suppose there were three candidates,
   known as A, B, and C, a population of 100 voters, and that the
   following distribution of rank orders was obtained.
   
   
           Ranking Order        Number of Voters

            A > B > C                  32

            B > C > A                  33

            C > A > B                  35

              Total:                  100
       
      
   From this table, it seems to follow that:
    1. Candidate C cannot be the winner, since 65% of the voters prefer
       Candidate B.
    2. Candidate B cannot be the winner, since 67% of the voters prefer
       Candidate A.
    3. Candidate A cannot be the winner, since 68% of the voters prefer
       Candidate C.
       
   
   This paradox is not an indictment of the concept of rank-order voting.
   Under most common vote-for-one-only systems, a winner would emerge,
   but not one with a "clear mandate". For instance, if the winner were
   determined by plurality on the first ballot, then Candidate C would
   win, despite the fact that 65% really would have preferred Candidate
   B. Just because the vote-for-one-only system conceals this preference
   doesn't change the actuality that it is indeed a preference of the
   voters. If the rules called for elimination of a least favored on each
   round until one candidate obtained a plurality, then Candidate A would
   be eliminated on the first round, and Candidate B would get the 32
   votes no longer assignable to Candidate A. Thus, Candidate B would
   win, even though 67% of the voters would have preferred Candidate A.
   
   
   The underlying measurement problem here is as follows: given a rank
   order preference list supplied by each voter, how do you determine the
   "collective choice"? The area of applied mathematics in which this
   problem is studied is called "social choice theory", and one of its
   practitioners, Kenneth Arrow, has won a Nobel Prize in Economics for
   showing that no method of choosing a winner could simultaneously
   satisfy all the reasonable constraints one would wish to impose and
   still work in every instance.
   
   
   Perhaps the strongest aspect of vote-for-one-only systems is that they
   are easily understood and difficult to manipulate by voter-bloc
   discipline. Under political conditions in which voters commonly feel
   that the problem is choosing the "least undesirable candidate", since
   no desirable candidate exists, it is sometimes suggested that voters
   should have the option of voting "no" against their least-liked
   candidate, where a yes counts +1 and a no counts -1. There is both a
   "yes-or-no" system, in which one may vote either "yes" for a
   most-favored candidate or "no" against a least-liked, but not both,
   and a "yes-and-no" system in which one may make both a "yes" vote and
   a "no" vote. A prominent advantage of such a system is that
   "protest-voters" would not end up "wasting" their votes on obscure
   candidates with no chance of winning. Nonetheless, "no"-vote systems
   have problems of their own, and this is explored in some of the
   exercises below.
   

   3.5 Exercises
   
   1. There are many different reasons why a prize might be awarded to a
   baseball team for performance in league competition. Identify the type
   of scale to which each of the following possible criteria belongs, and
   explain your answer.
   a) number of games won against other teams in the league
       b) zodiac sign of the shortstop
       c) distance between the catcher's eyes
       d) hair color of the centerfielder
       
   
   2. Jones has an expensive foreign-made analog watch with a jeweled
   movement that gains at most 12 seconds a month. Every month, Jones
   resets his watch according to the time announced by the telephone
   company time service. Smith has an inexpensive digital watch that
   gains at most 2 seconds a year. However, Smith sets her watch five
   minutes fast, in order to avoid missing commuter trains. Whose watch
   is more precise? Whose watch is more accurate?
   
   
   3. Consider a two-dimensional square of definite size. Among all
   possible transformations that can be made on this square, there are
   some that will leave it in a position indistinguishable from its
   original position (e.g. rotation in the plane by 90 degrees; a
   rotation of 180 degrees around the diagonal of the square, or a
   reflection in a two-sided mirror positioned positioned perpendicular
   to the square and passing through its middle). We call such
   transformations the symmetry transformations of the square, and say
   that the more symmetry transformations a system has, the higher its
   degree of symmetry. What sort of scale is being used in this sort of
   measurement? Under what conditions would it be sensible to compare the
   degrees of symmetry of two systems?
   
   
   4. A conversation is taking place between two members of your college
   class, and they are talking about their academic standing when they
   were in high school. One student tells the other that she was 2nd in
   her graduating class, and the other student remarks "Oh, well, you're
   about three times as smart as I am; I was only 6th in my class."
   Explain what's wrong with the second student's reasoning. Would the
   second student be warranted in asserting instead that the first
   student had done better than he had in high school?
   
   
   5. Write a recursive program for Fibonacci numbers. Use your program
   to calculate the 5th, 10th, 15th, 20th, 25th, and 30th Fibonacci
   numbers. Use a wrist watch to record the time it takes for each
   calculation.
   
   
   6. The text of this chapter would seem to imply that in deciding
   whether or not a particular discipline is or is not scientific,
   reproducibility of measurements is what ultimately counts. Do you
   think this is a reasonable way to demarcate science from
   "non-science", and/or can you think of other measurement criteria
   which you think are more important than "reliability" in establishing
   knowledge claims? Do you think that the criterion of reliability rules
   out the possibility of scientific status for some disciplines in
   principle (e.g. economics, psychoanalysis, literary theory)? Please
   explain your answers carefully.
   
   
   7. For the given four-state example, show that an increase from 30
   representatives to 32 would not have caused an instance of the Alabama
   paradox, but that an increase from 30 to 34 would.
   
   
   8. Either construct a two-state example of the Alabama paradox, or
   prove that it is impossible to do so.
   
   
   9. Construct a three-state example of the Alabama paradox.
   
   
   10. A "yes-and-no" voting system can be viewed as a reduction of a
   preference ranking system, in which all that matters is who each voter
   likes best and least. If voters favoring a second-most popular
   candidate cast their "no"-votes against the most popular candidate,
   instead of a candidate they actually like least, this can sometimes
   tilt the net tally so that the second-most popular candidate wins. In
   other words, a "yes-and-no" system can deteriorate into deliberate
   manipulation. Construct a 4-candidate example in which this phenomenon
   could occur.
   

   3.6  Notes
   
   [1] At M.I.T., literature, art, music, foreign languages, and
   history are collectively a single department, while there is finer
   differentiation among the sciences and the engineering disciplines).
   
   [2]Deciding which subjects or groups of subjects are comparable
   raises a number of difficult questions, and this issue is often left
   to experts. This obviously presents problems if these same experts
   have a stake in the outcome of a particular research program.
   
   [3]One of these other criteria, of course, would be explanation.
   As you read more about models, you might ask yourself how or whether
   successful models actually explain, or merely describe the phenomena
   being modeled. Do you think that explanation, as opposed to
   description, is very important in science, or is description +
   prediction the only thing we should be concerned about? Similarly, you
   might consider whether you think the predictive capability of certain
   physical theories is the defining characteristic of natural science;
   that is, consider whether or not you would classify a discipline as
   "pseudo-science" if the theories of that discipline were not
   predictive.
   
   [4] This is not to say that scientists always comply with this
   requirement. Sometimes scientists temporarily conceal a few crucial
   details for a number of months, in the interest of keeping a lead over
   other researchers, and thereby advancing their own careers. Like
   investment counselors, some scientists have been known to protect what
   they perceive as their own self-interest.
   
   
   [5] Named for the 14th century English philosopher, William of
   Ockham, this is the maxim that the number of assumptions introduced to
   explain something, or the number of entities postulated by a theory,
   must not be multiplied beyond necessity. Of course, what constitutes
   "necessity" is often far from clear, and the criterion of "simplicity"
   in the evaluation of models is notoriously difficult to specify.
   
   [6] You may recognize the name as that which describes the
   sequence of numbers, each of which is the sum of the two previous
   numbers. Named for the 13th century Italian mathematician Leonardo
   Fibonacci ("Leonardo, son of Bonaccio"), these numbers have fascinated
   mathematicians and scientists alike. In this case, the program
   Fibonacci instructs the computer to simulate the growth of the
   variable p(n).
   

     _________________________________________________________________
   
   (This is the third chapter of the first volume of The Scientific
   Experience, by Herbert Goldstein, Jonathan L. Gross, Robert E. Pollack
   and Roger B. Blumberg. The Scientific Experience is a textbook
   originally written for the Columbia course "Theory and Practice of
   Science". The primary author of "Measurement" is Jonathan L. Gross,
   and it has been edited and prepared for the Web by Blumberg. It
   appears at MendelWeb, for non-commercial educational use only.
   Although you are welcome to download this text, please do not
   reproduce it without the permission of the authors.)
     _________________________________________________________________
   
 This document can be found at MendelWeb (http://www.mendelweb.org)