Intelligence and Educational Assessment

tests.gif (3958 bytes)

What you should know

  • Types and limitations of psychometric tests
  • Assessing educational performance at different ages (eg Key stages, the role and types of national examinations

Reading list

  • Guy R. Lefrançois - Psychology for Teaching (8th Ed) - Chapter 7
  • Steven R. Banks and Charles L. Thompson - Educational Psychology - Chapter 11
  • David Fontana - Psychology for Teachers (3rd Ed) - Chapter 5
  • Dennis Child - Psychology and the teacher (3rd Ed) - Chapter 9

3 basic views of intelligence

  • 1 Psychometric - measurement - what IQ tests measure, general underlying ability (g) problem for racists
  • 2 Piagetian - child/environment interaction
  • 3 Information - processing - cognitive processes (e.g. Sternberg )


Vygotsky (1986) - potential (Hebb calls this intelligence `A').
Current level of intelligence (intelligence `B').
Cattell - fluid abilities - basic - non-verbal - unaffected by experience (more susceptible to old age). Also prospective memory (see Mayner 95).
Crystallised - primarily verbal, influenced by culture and education. Will increase with age (Horn and Donaldson, 1980)
Gardner - 7 unrelated multiple intelligence's (see Table 7.1).
Problems

  • 1) some components are difficult to measure.
  • 2) intelligence can not be separated from culture and education. (Gardner and Hatch 1989).


Sternberg's information processing view
contextual subtheory - intelligence is the successful adaptation to the environment. this could be assessed by asking people what is intelligent or stupid in their culture. In North America three broad groupings of abilities emerge:

  1. practical problem-solving ability
  2. verbal ability
  3. social competence

different results might be expected in different cultures

advantage: intelligence is now observable or concrete rather than abstract and academic

disadvantages:

  1. this view is too inclusive - nearly all behaviours are potentially intelligent because any one behaviour may be useful in at least one context even though generally speaking the behaviour may not be that useful elsewhere.
  2. this view does not describe the processes and structures that help to explain intelligence.

Therefore the three component subtheory is also needed

Three component subtheory

  1. Meta-components - executive processes, selecting cognitive abilities, monitoring them and evaluating their results
  2. Performance components - activities used in carrying out cognitive tasks.
  3. Knowledge-acquisition components - activities involved in acquiring new information.

Intelligence tests correlate well with school achievement.
Conventional measures measure the extent to which the individual has profited from past learning experiences.
Vygotsky (1996) and Feuerstein (1979) to measure learning potential, subjects must be placed in situations in which they must learn rather than in situations where past learning is tapped.
Bloom (1964) correlation's at 0.80 for IQ given at ages 5 and 17.
Use of computers and calculators increases IQ (Salomon et al, 1991).
IQ tests do not tap important qualities, such as interpersonal skills, creativity, athletic ability.
Many are biased against social and ethnic minorities.
`Culture-reduced' tests - non-verbal, use pictures or abstract designs (e.g. Ravens Progressive Matrices test)
McClelland (1973) argues IQ tests bear little relationship to success in life, but Barrett and Depinet (1991) conclude IQ is positively related to job performance.
IQ score ranges from 50 - 160 (average 100)

Group tests given by a teacher

  • Draw a person - originally developed by Goodenough (1926) and revised by Harris (1963) and Naglieri (1988). Children's drawings are supposed to reflect their conceptual sophistication.
  • CogAT - a paper and pencil test for grades 3 to 13. Three scores are given - verbal, quantitative, and non-verbal, which are combined to give an overall IQ score. A child's performance can be compared with normative data.
  • Otis - Lennon School Abilities Test - this test is suitable for grades 1 to 12. Its yields just one score (known as the "school ability index" (SAI)) and the test is made up of a mixture of items including vocabulary, reasoning, numerical etc., arranged in order of difficulty.

Individual tests

Expensive, but reliable for important decisions. Need an expert to administer these.

·         Peabody Picture Vocabulary Test-R

Choose 1 picture out of 4 that matches word spoken by experimenter. Items are arranged in order of difficulty and the test is terminated after six consecutive incorrect answers. The child's IQ is calculated by taking into account what level the child achieved in the test as well as his or her age.

·         Revised Stanford-Binet

This test does not use the term IQ; Instead the term 'standard age score' (SAS) is used. Items are graded in order of difficulty. Four separate scores are given: verbal reasoning, quantitative reasoning, abstract/visual reasoning, and short-term memory. These scores can be combined to give a measure of "adaptive ability".

    • Verbal reasoning
    • quantitative reasoning [maths]
    • abstract/visual reasoning [block design, copying figures, predicting what a folded design would look like once unfolded]
    • short-term memory (repeating a sentence, reproducing a pattern made of beads, recalling a series of pictures in order)

·         Wechsler scales
(WISC-III)

Similar to Stanford-Binet
Adult and pre-school versions exist.
2 basic sections - verbal (reasoning and vocabulary skills) and performance (visual-spatial skills)

Verbal section has 6 subtests

1.    Information test - general knowledge

2.    Similarities test - comparing two items for similarity

3.    Vocabulary test - defining words

4.    Comprehension test - asked about what would be appropriate in a given situation

5.    Arithmetic test - questions presented verbally

6.    Digit span test - repeating back digits

 

The Performance Section

7.    Picture Completion test - identifying missing part of picture

8.    Picture Arrangement test - placing pictures in order, so as to tell a story

9.    Block design test - Arranging cubes with red and white designs, so as to copy given pattern

10.                        Object assembly test - constructing an object from pieces that need to be joined in a fixed order

11.                        Coding test - copying non-verbal symbols in order

12.                        Mazes test - tracing a maze

See Le françois table 7.2 p190

SOMPA (System of Multicultural Pluralistic Assessment).

Assess biological and social normality, derive an estimated learning potential (ELP) score - based on WISC-III scores - standardised on ethnic minority samples. - take into account important family variables (e.g. size, income, structure, socio-economic status)
Sattler (1982) criticises SOMPA -The Californian sample not representative, SOMPA predictions no more valid than WISC-III alone. Not wise to use a medical model for educational decisions. Good for detecting gifted African-American children, not detected by other tests (Matthew et al 1992)

Factors that affect manifested intelligence.

  • Family size and birth order
  • Ethnic background
  • Social class

Rubber-band hypothesis

We are all born with different sized rubber bands (potential intelligence). These bands can be stretched. Large bands can be stretched further than small bands, but small stretched bands are longer than unstretched `big' bands.
First-borns and only children have higher intelligence, and academic performance.
Intellectual climate of home is a function of family size and position in the family (Zajonc).

Definition of creativity

These are on p197 relate to examples on p198.
Gallagher (1960) - teachers miss 20% of the most highly creative students. School dropout for gifted adolescents is higher than for general population (McMann & Oliver, 1988)
Mistake to think that creativity is to be found only amongst those with the highest IQ. Evidence on pages 199-200.

Measurement of Creativity

Unusual uses test - e.g. brick or nylon stocking. Score for fluency, flexibility and originality (occurs less than 5% of the time).
High intelligence important, but personality and social factors are also important, for creativity.
Getzels and Jackson (1962) - creative students not necessarily have the highest IQ. Not liked by teachers.
High correlation between measured creativity and IQ scores (McCleod & Cropley, 1989)

Guilford 's model of intelligence.

Guilford (1959) see fig 7.7 p202.
120 distinct human abilities
Allows for creativity and intelligence in one model.

Implications for teachers

  • 1 Complexity of intellectual processes.
  • 2 Variety of forms in which ability can be expressed.
  • 3 Importance of instructional process. Need greater emphasis on creative thinking, evaluation, implications, etc. Programs designed to foster learning/thinking strategies.

Divergent and convergent thinking.

Divergent is generating several ideas from a given problem.
Convergent is deriving one solution from a given set of facts.
Divergent thinking is synonymous with creative thinking.

Validity

Face

- appears to measure what it is supposed to.

Content

- is it measuring what is being taught?

Construct

- hypothetical variables - also measured by other tests. - e.g. extroversion is a meaningful concept?

Criterion related

·         Concurrent

- agrees with other tests

·         predictive

Reliability

- affected by improvement (with age)
chance (especially with multiple-choice) - best to make tests longer or to use many shorter ones.

  • repeated measures
  • parallel forms - 2 versions of same test
  • split-half reliability

Maguire (1992) - Teachers often just teach students to pass a test.
Wolf et al - Current school tests - test the skill to detect and select rather than generation.
Memory based, rather than to promote thinking.

Standardized tests

- students results compared with norms.

Use

  • Placement
  • certify achievement
  • Judge teachers
  • evaluate schools
  • Instructional diagnoses


In America - Anti-testing movement in 50's and 60's because tests thought to be unfair.
But, report `Nation at risk' (1983) persuaded teachers to use tests again.
Nolen and Haas(1991) - raising educational standards is equated to raising test scores.
Teachers are embarrassed by tests, so they teach children to pass tests, which invalidates tests.
Teacher-made tests - essays - maths tests - used to grade or see whether ready for next module.

Types and limitations of psychometric tests

Teachers set their own tests because the tests can cover the material that they have taught. Packages may be too broad.

Evaluation should motivate students, rather than to demotivate them. Tests provide feedback to the students, telling them what needs to be improved and what parts of the curriculum have been mastered.

Tests are also used to make schools more accountable. In America many schools are being too generous with allocating grades (known as 'grade inflation'). The same standardised tests, used by many schools, should guard against grade inflation.

ARE INTELLIGENCE TESTS BIASED?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some critics believe standard IQ tests are biased against certain
personality types. For example, Queens College mathematician, and author of
"The tyranny of testing", Banesh Hoffman wrote in 1962 that standardised
tests disadvantage "intellectually honest candidates with subtle, probing
critical or creative minds" - an enduring criticism that refuses to die
away. In fact, only a few years ago, an academic, Robert Reich, who had
previously served as the US politician responsible for employment,
criticised standardised tests because of their inability to measure
creativity, an attribute he considered vital to many current jobs.

Donald Powers (Educational Testing Service, Princeton, USA) and James
Kaufman (University of California, USA) investigated the relationship
between the Graduate Record Examination (GRE) scores of 342 students and
their conscientiousness, rationality, ingenuity, quickness, creativity and
depth, as measured by self-report personality questionnaires. The GRE is an
IQ-type test used to select candidates for postgraduate study in America.

Overall, the researchers found no substantive evidence to support the
criticisms made by Hoffman and others that IQ-type tests are biased against
creative types. Any links between intelligence scores and personality were
modest and, in fact, relative to the low creativity scorers, there was a
tendency for the students with higher creativity scores to perform better
on the analytical, quantitative and verbal measures of the Graduate Record
Examination.
_______________________________________

Powers, D.E. & Kaufman, J.C. (2004).
Do standardised tests penalise
deep-thinking, creative, or conscientious students? Some personality
correlates of Graduate Record Examination test scores. Intelligence, 32,
145-153.

Journal weblink: http://www.sciencedirect.com/science/journal/01602896

Robert Reich's article in Education Week (free registration required):
http://www.edweek.com/ew/ewstory.cfm?slug=41reich.h20&keywords=reich

Purchase "The Tyranny of testing":
http://www.amazon.co.uk/exec/obidos/ASIN/0313200971/qid=1079705589/sr=1-1/ref=sr_1_0_1/026-5190474-4787621

Graduate Record Examinations: http://www.gre.org/

Formative Evaluations

These are tests and essays given at various times throughout a course, in order to find out what needs to be improved. It may be the student who needs imporoving, or perhaps the teaching! Its best to use criterion-referenced evaluation (Gronlund & Linn, 1990).

Summative Evaluations

End of course, or module test. Used to grade a student. Best to use Norm-referenced evaluation (Grunlund & Linn, 1990).

Objective Tests

Fig 13.4 -

  • completion
  • matching
  • True-False
  • Multiple Choice.

Essay versus Objective tests

  • 1 Easier to tap higher-level processes - organisation, inferences etc.
  • 2 Content limited - can't test a wide variety of topics.
  • 3 Make divergence possible
  • 4 Easy for teacher to construct essays
  • 5 Scoring essays takes longer. Objective tests can use computer technology.
  • 6 Essay scores are unreliable. (Educational testing service 1961)


300 essays rated by 53 judges on 9 point scale
one-third received all possible grades
37% received 8 different grades
23% received 7 different grades
some markers gave moderate marks
others give extremes
knowledge of student affects scores
halo effect - first few good answers affect how rest of essay is marked.

Suggestions for constructing tests

Essays

Questions should be specific for easy scoring.
Restricted response easier to score (as opposed to open-ended or extended-response). For example: In two paragraphs or less, list two similarities and two differences, etc.
Sufficient time to allow students time to use high-level processes (i.e. planning).
Weighting specified.
Wording should make clear the teacher's expectations.
Scoring - outline model answers for one answer before going onto the next.
Intend to be objective.
Specify the number of points available for each part of essay, eg content, organization, application, synthesis of ideas.

Multiple-choice items

Stems

  • Stems should be complete; not half of a sentence which is completed by one of the alternatives.
  • Make stems clear and concise
  • Stems should be longer than the alternatives
  • Questions should be positive (eg 'Which one was....') rather than negative (eg 'Which one was not...').

Alternatives

  • should be grammatically consistent with stem
  • each should be a plausible answer
  • students, when guessing, tend to choose either the first or the longest alternative.

Problems

  • Unable to measure problem-solving ability
  • Unable to measure critical thinking
  • Does not measure the student's ability to develop and express an idea
  • Does not measure the application of knowledge to new situations

Norms and normal distribution

Students should make sure that they understand the

  • 'Normal Distribution Curve'
  • Mean and Standard deviation
  • percentiles - eg 75th percentile is the score where 75% of subjects fall on or below it
  • Z-scores - equivalent to standard deviations, with the mean being equal to a Z-score of 0
  • T-scores - a T score of 50 is the mean and a T-score interval of 10 is equal to one standard deviation (eg a T-score of 60 is 1 standard deviation above the mean)
  • Stanines - a score of 1 is 2 standard deviations below the mean, 3 is 1 standard deviation below, 5 is the mean, and so forth up to a maximum of 9 (2 standard deviations above the mean)

(see pages 372-3 Lefrançois, pages 345-7 Banks & Thompson)

Reporting Test Results

Central tendency - mean
median
mode (not really useful!)

The Standard Deviation calculation is illustrated in Table 13.4

Criterion-referenced testing

Anecdote (story) about having to leave lowlands before dark or else eaten. If you are last, as long as you are high enough before dark you are just as well off as the person who was first.
Norm-referenced - compare to others
therefore student can be seen as good in a class of low ability
or student can be seen as bad in a class of high ability
Criteria-referenced - pass a criteria (as in above anecdote).
Choice depends upon what is being tested.
Easy to set criteria for typing, less so for social studies. Criterion-referenced - basic skills, Norm-referenced - higher-level skills (Hopkins, Stanley and Hopkins 1990).
Criteria referencing - no student need consistently fail. This can lead to grade inflation . Suitable cut-off points could be derived from the norm-referenced results of the previous year's classes. Exclusive reliance - thwart students' initiative.
Norm-referenced - better for predicting academic success; but decrease cooperative learning and interaction.

Cureton's (1971) recommended cut-off points for norm referenced data

Grade

Standard deviations from mean

Percentage of students achieving grade

A

1.5 above

7

B

0.5 to 1.5 above

24

C

0.5 below to 0.5 above

38

D

1.5 below to 0.5 below

24

F (Failure)

1.5 below

7

Ethics of testing

personality tests can invade privacy when they probe into matters that would not ordinarily be publicly revealed.
Tests are threatening when placement, job opportunities, success and failure - depend upon their results.
Can be unjust when used with groups for whom they were not designed.
In
America, parents have right of access to school files. When adolescent reaches 18 they can have access.
current testing practices - G. Grant (1991) "test behaviours that are easy to measure, encourage individual accomplishment and competitiveness rather than group performance".

The American Scholastic Aptitude Test (SAT)

Dates back to 1926. Used as an entrance exam for college. Taken by 1 million students in 1991 (Dodge 1991). Two subtests - Verbal and Mathematical

Test bias in the SAT

Average SAT scores in 1991 (Dodge 1991)

 

Verbal

Maths

African americans

351

385

Whites

441

489

The difference in mathematical ability between the two groups is unlikely to be cultural because maths is relatively independent of cultural bias. The difference is better explained as reflecting a socioeconomic bias. Many African Americans make up the lower socioeconomic ranges. Should something be done about this? Can anything be done?

Can students be prepared for tests?

Messick (1982) - substantial improvements can be made
Cunningham (1986) - modest improvements for short term courses. Intensive training may produce greater scores. Maths can be improved to a higher degree compared to the verbal score.
Such courses are expensive. This accentuates the pre-existing socioeconomic bias.

The decline in SAT scores

The decline in SAT scores

 

Verbal

Maths

1963

478

502

1981

424

466

1991

422

474

Overall decline from 1963 to 1991 is 9.1%.

Possible explanations

  • Massive increase in the college population. Students of broader abilities. Although in recent years there has not been such an increase in students, but the scores still went down!
  • Social causes such as increases in teenage pregnancies and single-parent families. Both parents out to work. Too much TV and not enough reading.
  • Teachers failing to maintain standards.
  • Parents with low expectations who do not motivate their children.

Alternative testing techniques

Developmental assessment - actual accomplishments.
Do not compare with others.
Example of checklist Fig 13.6.

Sampling Performances of thought

Sereda 1992 - Child explains what he is thinking as he attempts to solve problem.
Points awarded for correct strategies, etc.
(See italics on p379)

Exhibitions

Wolf et al (1991) emphasises the profoundly social nature of thinking.
Perform in front of others.
Oral examinations
Musical recitations, etc.

Portfolios

Collection of any evidence of ability, collected over much of the student's time at school.

For

Emphasis on learning how to learn.
Autonomous, reflective, independent, creative thinking.
Might expose social and intellectual skills.

Against

Cumbersome, time consuming, less exact.
Not easily quantifiable - not suitable for deciding which students get admitted to college, or for deciding who gets a scholarship.
Ewell (1991) Generates volumes of material - but no way to analyse it.

The British National Curriculum and Testing

The national curriculum

GCSE

NVQ

Key Studies

Assessing Educational Performance

Study 1 - What do we stand to learn about an individual whose intelligence is

measured by an IQ test?

Type of study - Review article

Alpay (2003)

 

Aim – criticism of IQ tests

1)       1)              It is not meaningful to describe intelligence as just one number (IQ score) as this does not take into account the many different types of intelligence. All IQ tests make assumptions about the nature of intelligence and there are ongoing debates about whether intelligence is a single entity or covers a broad spectrum of abilities. There is also controversy concerning the degree which intelligence is innate. However, it is widely accepted that modern standardised tests do not measure all abilities, such as creativity, practical sense and social sensitivity. The fact that these tests (such as the Weschler scale) correlate well with school achievement is no surprise as this itself primarily focuses on the logical-mathematical and linguistic abilities emphasised by IQ tests. The widespread use of these tests has created a culture in which other abilities (or intelligences) are not equally valued.  The Weschler scale emphasises mathematics, logic and language, which means other forms of intelligence are devalued.

 Evaluation point 1 – (Validity) Intelligence tests are not a valid means of assessing intelligence.

Other criticisms of the way IQ tests have been used in research are listed below:

Differences in IQ scores between different nationalities have been attributed to differences in intelligence, whereas it is more likely that this represents differences in culture and schooling. Gould is relevant here.

Evaluation point 2 – could be nature-nurture or/and individual differences.

Also….Test differences have been attributed to innate factors, ignoring the possibility that they may arise from early environmental differences, as shown by adoption studies.  For example, whereas black students tend to score lower than white students, there are no such differences between white students and black students adopted or fostered by white guardians; this implies that ethnic differences in IQ scores are a result of upbringing and not genetics. Evaluation point 3 -Nature-nurture.

 

No overall gender differences have been found, but the differences between males and females have been attributed to physiological causes instead of environmental ones.

 

Evaluation Point 4 IQ tests lack ecological validity -IQ tests are given outside the context of normal human behaviour, and this may favour certain types of people. Even Raven's Progressive Matrices, which rely on abstract visual reasoning and reduce bias associated with prior knowledge, are not reflective of the experience­ based intelligence which is required in everyday life.

 

. The concept of an IQ test gives little recognition to Vygotsky's ideas of a zone of proximal development in learning.  Furthermore, as Piaget noted, it is not the number of correct answers that matters, but the reasoning behind them, and IQ tests take little account of this. 

 

Given all these criticisms, what does an IQ test tell us about an individual? If the abilities for a particular task can be clearly defined, and

social, motivational and other personal characteristics are not relevant, then it may be possible to design a test that measures an individual's ability to do the task; however, this is not a situation that is likely to come up very often. Other benefits of IQ testing may be to assess specific aspects of an individual's abilities to help with further development of these abilities. Similarly, where IQ tests give low scores, this may help

identify an individual's learning difficulties and any special educational

provision that would help them.


KEY STUDY 2

Criterion-referenced assessment as a guide to learning - the importance

of progression and reliability

Green (2002)

The author of this study works in the research department of an

examination board, and her paper consists of a discussion of criterion­

referenced assessment. She starts by saying that the aim of criterion­

referencing is to '... focus on individual, differentiated assessment. By

moving away from norm-referencing, to a system which describes what

students know, understand and can do, assessments can be used to

provide feedback and inform future teaching and learning needs'.

In order for criterion-referenced assessment to work, it is necessary to

define 'success' at a given level, by describing the types and range of

performance that students at that level should be able to demonstrate.

Green stresses how important it is that performance scales be age

independent, so that they can be used to assess a student's progression,

and not just their achievement at one particular moment in time. A major

difficulty of assessment occurs when levels of performance require

interpretation and human judgement; if assessment is more subjective,

then it is less reliable. For 'true' criterion-referencing, we should not

accept criteria that allow for a range of interpretations. However, such

criteria would be too numerous, narrow and unmanageable. Green

argues that in order to reconcile the demands of rigorous assessment and

the problems of subjective evaluations, it is necessary to create a shared

understanding of subjective evaluations of performance and of

progression in the curriculum. In other words, it is better to stick to

evaluation, rather than use numerical measurement, but attempt to

establish comparability of standards through the professional judgements

of a community of experts.

 

Key STUDY 3

Assessment and classroom learning

Black and Wiliam (1998)

This review article looks at the effects of formative assessment and argues that overall standards rise if assessment is used to identify students' learning needs. The authors looked at 600 research studies from all over the world, involving more than 10000 learners. These studies were carried out at all levels of the curriculum and across a range of very different subjects. The studies show that assessment that diagnoses students' difficulties and that provides specific and constructive feedback, leads to an improvement in learning. In a Portuguese study of 246 students with 25 teachers, students were given learning objectives and assessment criteria by their teachers, then asked to rate their own performance on a daily basis. The study showed that students in the experimental group progressed twice as fast as those in the control group.

Black and Wiliam conclude that it is the quality of the feedback given by teachers that is one of the most crucial factors if formative assessment is to be a success. Their survey indicates that there are five factors that are crucial for success, and five that are detrimental.

Factors crucial for success:

  • Regular classroom testing (to improve learning and teaching, not for competitive use)
  • Clear, meaningful feedback
  • The active involvement of all the students
  • Careful attention to the levels of self esteem and motivation of each student
  • self-assessment by the students (both in groups and with the teacher).

Factors limiting success include:

·         ·                  tests that are superficial and encourage learning by rote

·         ·                  failure of teachers to review testing procedures with each other

·         ·                  over-emphasis on marks and grades at the expense of meaningful advice

·         ·                  too much emphasis on competition between students (instead of focusing on personal improvement)

·         ·                  feedback, testing and record-keeping, which is for managerial purposes rather than learning purposes. E.g. SATS

The problem that the authors identify in the UK is that although formative assessment is recognised as important, the education system is very much geared towards summative assessment (SATs, GCSEs and A levels, for example).

 

Evaluation point 1 – Generalisation -The authors looked at 600 research studies from all over the world, involving more than 10000 learners. These studies were carried out at all levels of the curriculum and across a range of very different subjects.

 

Evaluation point 2 – Usefulness – teachers can use these results.

 

Aim – To find out what makes for good formative assessment.

 

Key Study 4

Valencia, S. W., 1997, 'Authentic classroom assessment of early reading: alternatives to

standardized tests', Preventing School Failure, 41, 2, 63-4

 

Aim - To provide evidence that standardized tests are not the only method of measuring reading ability in pupils in early years.

Type - A report of three case studies.

'Rebecca', a first-grade pupil, 'Laura', a fifth-grade student with special educational needs and 'Elizabeth', a third-grade student who had been assessed in every grade, from grades 1 to 3. All were pupils at a primary school in the USA.

Authentic classroom assessment of early reading involves using activities that are similar to the everyday activities used in reading: students interact with real books, engage in meaningful discussion about them, write about what they have read and set their own goals. These activities involve the application of strategies and skills in many different reading contexts, as opposed to the isolated testing of such skills that is carried out via standardized assessment. Students are observed in a number of reading situations and the teacher is able to compile a profile of the reading skills and strategies they display and place them in context (i.e. which skills and strategies they use in which situation).

Rebecca had not yet started to read and so was tested for emergent reading skills, including such items as knowing the right way up for a book, knowing that books are read from left to right and front to back, being able to differentiate between letter, word and sentence and understanding the sequence of a story (beginning, middle and end). Three books suitable for emergent readers were placed in front of Rebecca and she was asked to pick one to read with the researcher. Her initial behaviour with the book was observed and recorded. As the researcher read the book with Rebecca, the latter was asked questions to elicit whether or not she understood what was happening in the story and what the pictures in the book showed. She was also encouraged to 'read' predictable, repetitive passages, along with the researcher. Rebecca was assessed on three different occasions.

Laura was tested for both emergent and beginning reading skills. The difference between these skill sets is, essentially, the ability to identify letters, words and sentences. Laura was asked to read aloud and observations of her reading behaviours were made.

Elizabeth was tested at least three times a year for three years on a range of reading behaviours; the books she was tested on ranged from easy to difficult for her age group.  She was asked to read alone and to write down or draw pictures to represent what she had read; she also engaged in shared reading and questioning sessions with her teacher and with fellow pupils. A standardized form was used on all occasions to record the data about her reading ability. Elizabeth was also asked to self-evaluate her reading ability.

Case study 1 - Rebecca knew how to orient a book, move through it sequentially and use pictures to predict what was happening in the story. She was unable to read the print, but questioning revealed that she understood that print is read from left to right and from the top to the bottom of the page, the title of the book is found on the front cover and print, rather than pictures, is used to convey the story. She was unable, however, to identify letters, words and sentences.

Case study 2 - Laura showed the same range of emergent reading skills as Rebecca, but also was able to identify some letters and words. She was able to read aloud to a certain degree, but often used pictures rather than text to tell the story. Additionally, she often skipped words she found difficult, suggesting that she had not yet developed such strategies as using phonics to deal with unknown words. Laura also substituted words, this substitution being related to difficulties with recognizing vowels and word endings (e.g. she substituted 'worm' for 'warm' and 'looked' for 'learned'), and this hindered her ability to retell the story after reading it. When she was reading a story, however, she was able to retell it with little difficulty. This suggests that she was not reading for meaning, rather than lacking the ability to

understand story structure.                                   .

Case study 3 - By the end of the first year, Elizabeth had high levels of listening

comprehension, was able to summarize accurately many incidents from a story and

her self-evaluation of her reading ability matched that of her teachers. By the end of the year she was able to read appropriately targeted books entirely on her own. By the end of the second year, Elizabeth was able to read age-appropriate books alone, self-correct her reading errors and read with expression.

Evaluation point 1 – Ecological validity – The testing involved using the child’s usual reading routine rather than artificial reading tests.

Evaluation point 2 – Rich data – In-depth understanding of each child allowing formative assessment that is useful for suggesting corrective techniques.

Evaluation point 3 – Generalisation – It worked for these three girls in America.  Perhaps they responded well but others might not; unlikely though.  Perhaps in some cultures reading with an adult might inhibit responses.

 

Key Study 5

Rosenthal & Jacobson (1966)

Teachers' expectancies: determinants of pupils I.Q. gains

Field experiment.

It had previously been demonstrated that experimenters could influence the results of their experiment. This is known as "experimenter bias". It was thought that this effect could describe a broader concept which has become known as the "self fulfilling prophecy". This means that when a person is labelled as being a particular type of person then often that person will change their behaviour to fit the label. For example, if a teacher calls a pupil "the class clown", the pupil could display more clownish behaviour.

 

In this field experiment, all the children in a primary school in America are used as subjects. In the school there are six grades, which are equivalent to the six years of a British primary school. Each grade, or year, was split into three streams (above average, average, and below average).

 

The experimenters told the teachers at the school that they were going to administer an intelligence test that would determine which children would be academic "bloomers". These children would stand the greatest chance of becoming academically bright in the future. Flanagan's Tests Of General Ability (T. O. G. A.; Flanagan, 1960) was administered to all of the children. 20% of children in each of the 18 classes were chosen at random and labelled as bloomers. Their classroom teachers were told that these children were bloomers and therefore stood a good chance of becoming quite academic, when in fact, on average, the children would have been no different in academic ability than the rest of their classmates.

 

After eight months the test was administered again to all of the children and the IQ gains were calculated. To check for experiment bias a blind judge, or independent researcher, without knowledge of which children had been labelled as bloomers, tested some of the children for a third time.

 

It was found that the children who had been labelled bloomers had significantly higher gains in IQ (p=.02, One-tailed). The greatest gains were seen in the youngest children, grades one and two.

IQ gain

control subjects

experimental subjects

p

ten points

49

79

4.75

.02

twenty points

19

47

5.59

.01

thirty points

5

21

3.47

.04

There have been several criticisms of this experiment. The IQ test had not been standardised for the age range of children which was used. This means the test may not have been valid for children, particularly the youngest. Teachers may not have taken much notice of the list of children labelled as bloomers. Attempts to replicate this study have not been that successful.

 Evaluation

1          Ecological validity because it is a field experiment and students feel comfortable doing the tests as they are used to the tests.

2          Ethics – some children are disadvantaged by not being labelled.

3          Large Sample – good, should generalise

4          Wrong test – results would be lower, but it doesn’t explain the difference found.

5          (Blind) Independent experimenter checked reliability of second test by testing some children again.

6          Didn’t replicate.

Compare to Seaver (1973) which is a natural experiment.  Providing the sample is not biased there is little room for experimenter bias.

 

A natural experiment by Seaver (1973) has found some evidence to support the above results. In Seaver's experiment brothers and sisters who were taught by the same teachers were compared with brothers and sisters taught by different teachers. If the self fulfilling prophecy is true then teachers who had taught bright older brothers or sisters might well expect the younger brothers and sisters to be bright as well. This indeed was found to be the case. Similar findings were found for less bright siblings. As a control for any possible genetic link, the siblings taught by different teachers did not display similar levels of attainment.

 

Other web-pages

US backs quest for brightest children Guardian 06-06-00

a) frying pan or b) fire? Guardian 06-06-00

Melanie Phillips on intelligence testing