Performance Assessment for Science Teachers
Self-Test on Formative vs. Summative Evaluation
The role of "objectives," & "intended learning outcomes"
The role of inference in evaluation
Type of evaluation items
Self-Test on Test-Item Determination
Constructing objectively scored items
Self-Test on Objective Item Construction

In this background section there are included several topics which must be understood if you are to accomplish the goals of the course.
The decision making sequence should be as follows? The teacher has an instructional problem to solve. He selects the appropriate form of evaluation. Then he develops a measurement instrument or procedure. Using the measure, the teacher collects the desired information. Using the information, the teacher solves the instructional problem.
If pupil learning is to be measured then specifying instructional objectives and material to be learned is the first order of business. When intended learning outcomes have been identified, then the kind of evaluation which is most appropriate can be planned and used. In evaluating student ability to perform in the laboratory or in front of the class, for instance, one would select a checklist or a rating scale rather than an objective paper and pencil test. The question, "Is this measure or kind of evaluation the best method for determining what I need to know about my student, or my class to serve my specific purpose?" must be asked and answered. The debate over which is the best kind of test item or the best kind of test can only be resolved as you the teacher carefully consider the purpose which the test will serve.
![]()
You as a teacher, need to know the meaning of two terms, measurement and evaluation so that you can understand significant concepts to be taught later.
Measurement as used by teachers is a process of collecting information about the performance of a student or a class. It is a descriptive process. It answers such questions as "Can Mario measure liquid volume accurately?" "Which students can operate a microscope?" "Who can define which words?" "How many of 50 objective questions on elementary cell theory did Susan answer correctly?" "Can Karl communicate his thoughts clearly in an oral report?"
Measurement often includes the assignment of a number to express in quantitative terms the degree to which a pupil possesses a given characteristic. For instance, we measure Karl's ability to communicate, and then we record that he earned 60 of 100 points on a scoring sheet; we observe John and note that he can place an organism on a slide, locate it, and bring it into clear focus using high- and low-power; we test Susan to learn that she can match 70 of the words with their definitions; etc. In each case, we have measured a student's performance and reported it in numerical or descriptive terms. Such quantification tends to increase objectivity of the description so that it will have the same meaning from time to time and from person to person.
Measurement is not an end in itself. It does not imply judgments concerning the worth or value of the behavior being measured.
One of the most common tools of measurement used by teachers is the paper and pencil test. It measures many kinds of performance well. It is obviously not the only tool, however. A graduated cylinder is the tool to measure Mario's ability, a checklist or rating scale would help measure ability to communicate orally. Scales, cameras, tally sheets, anecdotal records and many more tools are used to collect information about (measure) student performance.
When a teacher makes value judgments about pupils' performance, then she is doing more than measuring. He is using measurement data to evaluate. All teachers evaluate pupils. Evaluation takes place when a teacher determines which students have satisfactorily completed a science course and which ones have not, when the teacher finds that John can operate the microscope better than anyone in the class, when we decide which students are eligible for participation in interschool competition and which students are not. Evaluation occurs when teachers and parents compare a child's potential with his or her performance: it takes place when teachers praise and encourage students. In our schools, evaluation is inescapable.
You may have noticed that in each example of evaluation above, there was a comparison made. The performance of students in the science course was measured. That performance was then compared to the minimum requirements for passing the class: those who met or exceeded the requirements passed. Student's qualifications or behavior was compared with the requirements, and some students were found eligible to participate in interschool competition; the child's performance was measured and then compared with his potential. Evaluation, then, is a process of comparing student's performance or characteristics against a standard...
A student's performance may be compared with the performance of other students (normative evaluation) as in the case of John above--he can operate a microscope better than anyone else in the class; or a student's performance may be compared with a predetermined standard (criterion evaluation) as in the case of determining which students are eligible for interschool competition. Deciding that Ann's spelling score of 70% earns her an A (any score of 65 to 80 is an A in this teacher's class) is another example of criterion evaluation because the teacher compared Ann's score with the pre-set standard she had set for A's, B's, C's, etc.
Although evaluations in education do not necessarily involve measurements, the usual purpose of measuring is to provide data that may be used in the evaluative process.
A process of collecting information.
Describes pupil performance or characteristics.
Usually expressed in quantitative terms.
Provides information for decision making.
A process of comparing.
Students are compared with other students or
with a predetermined standard.
Judgmental--valuing--decision making
Used to rank students or compare them to a standard
![]()
Teachers have many reasons for evaluating students. Some of the reasons are not defensible and will not be discussed. Most of the main reasons are considered in this section. They are classified as either primary or secondary reasons. You should know the difference
Primary reasons for evaluating pupils are those reasons which are an essential part of a teacher's main responsibility--helping students improve in knowledge and skills, feelings and attitudes; helping student learn.
Secondary reasons for evaluating pupils are those reasons which are not central to the teacher's responsibility to help students learn but which are often met through evaluation. The needs of others involved in education--parents of the students, administrators, taxpayers, etc.--are met through evaluation, but for this course, these are secondary.
![]()
Directions: Choose the best answer and write it on a separate sheet of paper. Do not turn in this test.
1.
- Which of the following is a primary use of evaluation for public school teachers?
- Evaluating ongoing programs being used in a school or district.
- Research on teaching methods so as to teach a future class.
- Improvement of learning of students.
- Determining student growth so as to rank them for grading purposes.
- Determining student growth so as to place them in special groups.
2.
- A school should administer many different types of tests because
- Developmentally, children change as they grow and mature.
- Parents will object if not enough test are given.
- No one test can measure all the varied facets of a child's ability.
- Of educational and administrative needs.
- The more tests given, the more students will learn.
3.
- I decide to use a criterion-referenced test in my mathematics class. This means my students scores will be compared with
- The number of items I have set as the minimum acceptable number correct.
- The average score made by other students in their class.
- The distribution of scores made by other students to establish a students rank.
- Their previous performance on similar tests.
- The average score made by all students who took the test.
4.
- Which of the following is NOT a primary use of evaluation for public school teachers?
- As a learning activity.
- As a basis for assigning grades to students.
- To improve instruction.
- To determine content mastery.
- To establish criteria or standards for a course.
5.
- Which one of the following is NOT a general principle of evaluation?
- Determining what is to be evaluated has priority.
- Evaluation techniques should relate to purpose served.
- Comprehensive evaluation requires a variety of techniques.
- Evaluation is no better than the learning activities provided.
- Evaluation is a means, not an end.
6.
- Which of the following would be considered measurement rather than evaluation?
- Bill finally made first chair in the school orchestra.
- John's study habits are ineffective.
- Mary got 70% correct on the spelling test.
- Sam earned a "C" grade with his average of 89%.
- Joe is the only one in the class who reads eighth grade books.
7.
- Collecting data on pupils is best defined as:
- Evaluation.
- Reliability.
- Norm-referenced.
- Measurement.
- Validity.
8.
- What do we mean, "Measurement should not imply judgment concerning the worth or value of the behavior being tested."
Answers: 1: c, 2: c, 3: a, 4: b, 5: d, 6: c, 7: d, 8: We are suggesting that measures provide raw data; that a judgment involves a comparison of the data with something--criteria, other pupils, or past performance.
The Validity of a test may be defined as the degree to which a test measures what it is supposed to measure. Since validity is a matter of degree, it is incorrect to say that a test is either valid or invalid. All tests have some degree of validity for any purpose for which they are used; however, some are much more valid than others.
Although there are several types of validity, classroom teachers are most concerned with the type known as content validity. Content validity is the extent to which the test or test items are an accurate sample of the total subject-matter content. Content validity relies heavily on the preparation of good instructional objectives to define the subject-matter to be learned. Properly written objectives can serve as a guide to the construction of valid test items.
If a test looks like it measures what it claims to, the test is said to have face validity. A test with face validity may or may not actually produce data which will correspond to the learning objectives. For example, an objective may call for a student to classify four types of leaves. At first, the item may seem valid, it has high face validity. Closer attention will show that the objective requires higher levels of thinking to classify and describe each leaf. The test then does not in fact measure what it purports to, and can be said to have a low degree of content validity. No chemistry teacher would think of measuring knowledge of analytical chemistry with a test on acid rain. Nor would a biology teacher think seriously about measuring microscope skills with an essay test.
The following illustration may help you to understand validity.
If your test were 40% valid we could represent the test like this:
One part of the test would be measuring your course content and one part would be measuring something else. Ideally, the two would overlap completely, like this.
If teacher-made tests are 80% valid they can be used to help students increase their learning.
There are many factors which may make test results invalid for their intended use.
Teachers may be able to improve classroom tests' validity by using the list below as a guide to test construction or as a checklist to review a test or a testing situation.
Reliability refers to the consistency of test results. If a test gives the same results when measuring an individual or group on two different occasions then the scores are reliable. If different teachers rate the same essay, for example, on the same criteria and obtain the same score then we say the scores are reliable from one rater to another. In both cases we are interested in consistency or trustworthiness. In a simple example you might consider the task of measuring the length of a room.
There are several instruments you could use. You could step off the length on two different occasions or two people could step it off. Another way would be to use a large rubber band with marks on it at one foot intervals. Again you could measure several times or several people could use the rubber band. A third choice might be to use a steel measuring tape. The tape will obviously give more consistent results; the more trustworthy results--from time to time and from measurer to measurer. Unless the measurement can be shown to be reasonably consistent over different occasions or over different samples of the same behavior little confidence can be placed in the results. The results are not reliable.
Reliability is an important consideration. It would be helpful if several teachers, each reading a book report, would give it the same score; otherwise, how can the score or the feedback notes to the students be trusted? It would be desirable if we could be sure that a test provided reliable scores of several samples of the same behavior or of a class' behavior over a given period of time.
Although reliability is a desired quality it provides no assurance that evaluation results will give the desired results. Little is gained if measures, or tests, consistently give the wrong information. Refer to our example with the steel tape measure. As reliable as it is, it is not a valid measure of room temperature. Of the two qualities, reliability and validity, validity is the more important.
There are three factors which influence the reliability of a test. They are:
Scores for objective test items are less subject to the opinions or values of the scorers and are thereby more reliable. In essay testing, when relying on observations of students' performance, or when rating the products of their work, scores tend to be unreliable. Later you will learn of some ways to increase reliability in such situations.
With norm referenced tests there must be a range of difficulty. Tests which are too easy or too difficult tend to be less reliable.
Length is a third factor affecting reliability. Generally speaking, the longer the test the more reliable. More specifically, the more items testing an idea or a skill the more reliable the test score will be. This is because more items reduce the chance that guessing will effect the score.
![]()
Directions: Match the terms with their uses or definitions. Terms may be used once, more than once, or not at all.
|
|
Answers: 1: b, 2: c, 3: b, 4: a, 5: a (in this case the test would be invalid), 6a: ambiguous objectives, few items, wrong kind of items, 6b: not enough items, items too easy or too difficult, ambiguous objectives, wrong kind of items, 7: train the scorers or prepare more objectively scored items, increase the range of difficulty of items, add more valid items.
![]()
Given examples of evaluation, correctly classify the examples as norm- or criterion-referenced evaluation.
List some appropriate ways to use both norm- and criterion- evaluation when teaching your students.
After students have been instructed, the usual procedure is to test. Evaluation is a comparison of the student's performance with a standard by means of a measuring instrument. There are two types of standards used in education: norms and criteria.
Norm-referenced evaluation is evaluation based on a comparison of a student's performance with one or more other student's performance on the same test.
Example 1: John received 70% on a physics test. The average score of John's class on the same test was 90%. Thus, John's level of performance was lower than the class average with which it was compared; as a result, he received a "D."
Example 2: In each of Mr. Green's classes, students are graded on a curve: 10% of the students taking a test receive "A's," 15% receive "B's," 50% receive "C's," 15% receive "D's," and 10% receive "F's." The students are compared with each other in terms of percentages. In a class of 20, the student with the highest score always receives an A, the next four always receive B's the students with the middle ten scores receive C's, the four below them receive D's, and the bottom one receives an F.
Question 1: Think of and list at least three teaching situations in which you would need to rank order the students from highest to lowest on some skill or ability in a class you teach. (Do not use "determine grades" as one of the three.
Criterion-referenced evaluation is evaluation based on a comparison of a student's performance with some preset performance standard which is determined independently of the test, or test scores.
Example: John received 70% on a physics test. In Mr. Atwood's class a score of 70% is always equal to a grade of "C." John's score was compared with a standard which is based on criteria which Mr. Atwood established apart from other students' scores on the same test.
Example: In Mr. Green's class, those students who achieve 70% of the course objectives receive "A's," those who achieve 60% receive "B's," those who achieve 50% receive "D's," and those students who achieve 30% or below receive "F's." Here the students' scores are determined by the number of objectives the student completes.
Example 1: Jim took a test before his fellow students had a chance to take the same test. He could not receive a grade until all the scores were collected and a class average calculated. This is an example of norm-referenced evaluation because the basis of evaluation for Jim's performance is the average of his fellow students' test scores.
Example 2: Susan earned 22 of 35 points on a test to measure liquid and solid volume. On the same day, her classmates' scores ranged from 21 to 7. Because Susan's score was the best in her class, she received an "A" grade. This is an example of norm-referenced evaluation because the basis for evaluating Susan's score was the comparison with the other members of her class.
Example 3: Susan earned 22 of 35 point on a test to measure liquid and solid volume. On the same day, her classmates scores ranged from 21 to 7. According to standards set by the teacher at the beginning of the year, any score above 20 earns an "A." Thus Susan received an "A." This is an example of criterion-referenced evaluation because Susan's score was compared to a standard which was independent of her class's average. Susan was not compared to other students' performance.
Example 4: Mr. Mechan published the following levels of achievement before a test was given to his eighth grade class:
top 3% of students in class would receive A
next 20% of students
in class would receive B
next 50% of students in class would
receive C
next 20% of students in class would receive D
last 5%
of students in class would receive F
This is an example of a norm referenced test because the standard of comparison was student with student.
Example 5: Mr. Mechan published the following levels of achievement before a test was given to his eighth grade science class:
100% - 90% of test items correct A
89% - 80% of test items
correct B
79% - 70% of test items correct C
69% -
60% of test items correct D
59% - 50% of test items correct
F
This is an example of criterion-referenced evaluation because the basis of comparison a preset standard: the percentage correct on the test. The comparison was not with other students.
Example 6: Seventh grade students cannot receive an "A" if they miss more than one laboratory report in a term. This is an example of criterion-referenced evaluation because the standard for an "A" is missing no more than one lab report. The comparison is not with any other students.
There are classroom situations in which normative evaluation should be used. Most standardized tests are norm-referenced so that one can compare the performance of a pupil or a class or school with the performances of other similar students. Such comparisons are often necessary in making program decisions regarding curriculum. Normative comparison through standardized tests may also be required in doing school related research.
Our present school system, especially secondary schools, sponsor some competitive programs which require students to be placed in rank order. Coaches of every subject are almost always expected to rank order their team members and select the best students to compete. The agricultural science teacher in selecting students for an animal judging team uses norm-referenced evaluation. Most awards--scholarships, honors, even the valedictorian--are selected with normative procedures. Whatever one thinks of these present procedures, as long as they exist, normative evaluation is necessary to rank-order students' ability.
Normative evaluation could be used appropriately by a teacher who is new to the subject matter in order to assign grades. Using a curve of some kind in grading ensures that some students will receive high marks regardless of ineffective instruction. The alternative, a criterion-grading system with unrealistically high standards, may ensure that no students receive high marks, when in fact students are learning most of what they are taught. Some teachers have developed a grading system which combines norm and criterion evaluation in an effort to get the best of both systems.
Usually when teachers are quite sure that their instruction is effective and that the students are typical, then criterion-referenced evaluation should be used. It has at least two advantages. First, it allows cooperation among the students rather than encouraging competition. Secondly, it tells all who are interested--teacher, students, and parents--how much of the material to be learned has in fact been learned. Without this information, teachers cannot adapt instruction, and students cannot focus effort where it is needed.
![]()
Direction: Place the letter of the illustrated form of evaluation (Norm or Criterion) in the blank space provided beside each situation.
N = Norm-referenced evaluation
C = Criterion-referenced evaluation
X = Not enough information to tell
Top 15% of the students in class receive A
Next 10% of the students
in class receive B
Next 50% of the students in class receive C
Next
10% of the students in class receive D
Last 15% of the students in
class receive F
Answers: 1: N, 2: C, 3: N, 4: C, 5: C, 6: X, 7: N.
![]()
Given examples of evaluation, the student will classify them as to either formative evaluation or summative evaluation with a score of 75%.
Evaluation used to aid in determining a student's competencies is essential in the instruction of each student. Evaluation should be continuously and frequently used during instruction.
Evaluation which measures student learning in order to identify how well they are learning or how much of the subject matter they have mastered in order to help them learn more or to help the teacher to improve ongoing instruction is formative.
Evaluation which tests students' performance to determine students' final overall assimilation of course material and/or overall instructional method effectiveness is summative.
To classify a particular example of an evaluation as either formative or summative, you must be able to place it on a continuum as being more like formative evaluation or more like summative evaluation. Each type of evaluation has certain characteristics, which are called ATTRIBUTES. On the following "Attribute Sheet," attributes are listed for each type of evaluation.
An example of one specific evaluation given by a specific teacher in a specific way at the specific time may have some formative attributes and some summative attributes. By knowing the attributes and being able to recognize those characteristics in an example you can then place that sample on the continuum.
Example: A science teacher gives a test that is part of the students' grade (summative attribute) after two days of instruction on only the first part of the total material contained in the unit (formative attribute), testing each idea specifically given in class (formative attribute). She uses the test to see if students have understood the ideas, repeating part of her lesson the following day on one particular idea that students did not understand (formative attribute).
This example has one summative attribute and three formative attributes. Therefore, the teacher's test is more like formative evaluation because of the material tested, the time it was given, and the purpose for which she used it.
| Attributes | Formative Evaluation | Summative Evaluation |
|---|---|---|
| Time | (1) Given at frequent intervals, usually after
small amounts of instruction.
(2) Pre- or during instruction. |
(1) Infrequently given throughout the year. Usually
at the end of a large amount of instruction, i.e., unit tests,
midterm exams, semester finals.
(2) After instruction is finished. |
| Material tested | (3) Tests specific skills; concepts and/or enabling
objectives.
(4) Ideally tests every concept and/or objective which has been taught. |
(3) Tests general concepts, skills and/or terminal
objectives.
(4) Samples from among the concepts, skills, and/or objectives that were taught. |
| Purpose | (5) Determines specific skills, concepts, and
objectives which student has/has not mastered, (diagnosis).
(6) Provides immediate feedback to students on their learning performance; often suggests learning activities. (7) Predicts probable performance on successive skills, course goals, and summative evaluations. (8) Identifies specific weaknesses in ongoing instruction (material and teaching procedures), allowing the teacher to remedy the instruction. |
(5)Used to determine students' grades and report
them.
(6) Used as a basis for subsequent revision or redesign of a course or program. (7) Predicts students' probable performance in subsequent courses. (8) Determines program or course effectiveness. |
Example: Mr. Jones tests his class once a week on specific skills involved in dissecting frogs.
Rationale: Testing with such frequency would suggest that the teacher is determining the students' progress throughout the unit of instruction and after teaching only one or two concepts.
Example: After a three week unit on dissecting frogs is completed, Mr. Clark tests students to determine their knowledge of dissecting procedures.
Rationale: Evaluation here occurs after all instruction on frog dissection is completed.
Example: Mrs. Farrell gives her class a quiz after teaching about every two or three cell organelles.
Rationale: Evaluation occurs after instruction on each idea and measures every idea and skill taught.
Example: Students in Mr. Hart's human biology class must master the first aid technique, CPR, on a dummy before working on a real person.
Rationale: Practicing the skill on a dummy enables students to properly perform the terminal objective of practicing on a real person.
Example: The students in Miss Karren's Integrated Science class receive a test covering general characteristics of the earth's atmospheric systems.
Rationale: This evaluation tests general knowledge of a whole unit.
Example: Miss Peterson gives her Biology class a test on five of the eight organelles taught in the recently completed cell unit.
Rationale: Miss Peterson's test evaluates knowledge of only a sample of the organelles studied. Test occurs after all instruction on cells is finished.
Example: After rating students' measurement skills, Miss Lawrence notifies the students as to skills mastered or not mastered.
Rationale: Miss Lawrence's test rates and diagnoses specific skills which are parts of a more general proficiency.
Example: After most of Mrs. Dusek's students failed the classification skill test in her first unit, she reorganized some of her lesson and re-taught parts of it.
Rationale: These test results were used to evaluate, diagnose, and alter instructional methods, as well as students skills, immediately.
Example: After each learning activity in Miss Bennett's Chemistry class (including tests, quizzes, homework assignments, etc.), the students are informed exactly which concepts they learned or didn't learn, and what they must do to learn the non-mastered concepts.
Rationale: The test is used her to give students feedback and prescribe learning activities.
Example: The results from Mr. Campbell's science test on planning an experiment showed that most students did not know how to write hypotheses. Thus, they were not allowed to perform an experiment yet.
Rationale: The test was used here to predict students' performance or failure, in this case, on the next objective in the unit.
Example: Scores on Mr. Chalmer's weather test were used to help determine students' grades in the course.
Rationale: Mr Chalmer's test used as the basis for students' grades.
Example: After most of Mrs. Disek's students failed their semester final, she revised her lessons and rewrote her test for the class the following year.
Rationale: The test results were used to revise instructional methods for subsequent classes.
Example: Mr Kool, a Science teacher, gave his classes a quiz at the end of each field trip during his ecology unit in order to see if students gained any useful information on trips. As a result of the quizzes, he eliminated some of the field trips the following year.
Rationale: Again, Mr. Kool's evaluation was used to revise instructional methods for subsequent courses.
Example: After reviewing the results of his biology students' final exams, Mr. Cutler selects 28 students to be in his advanced placement Biology class the following year.
Rationale: This is another example of prediction of the students' performance, prediction is concerning student's ability to succeed subsequent courses.
![]()
Directions: On the blank line write the letter of the type of evaluation (F, S, X) which the example is most like.
Hint: An example may have both Formative and Summative attributes, attributes of only one, or attributes of neither. If there are attributes of both, the answer in the left blank will be the letter of the type of evaluation for which there are the most attributes. There will never be an equal number of Formative and Summative attributes.
F = Formative Evaluation
S = Summative Evaluation
X = Neither
At the end of three days of instruction, Mrs. Liptaut evaluates her beginning science students on their mastery of microscope skills. F
Explanation: Due to the frequency of evaluation in relation to the amount of material taught and to the evaluation's purpose of assessing mastery, this is formative evaluation.
1. Mr. Page, a life science teacher, uses most of his testing to learn if students can synthesize science facts to explain the functioning of a cell. If students do not do well on a test, he re-teaches those concepts the very next day, in a different way. ____
2. Miss Fitt uses true/false questions to evaluate her students. ____
3. Mr. Brawn, the biology teacher, uses a matching question to evaluate students' knowledge of all the parts of a microscope. ____
4. To receive an "A" grade for chemistry, each student must perform a lab experiment by himself. ____
5. To revise the course content, Mr. Prove talks with students at the end of each unit, rather than giving them any formal evaluation. ____
6. After a unit of instruction, Ms. Take rates her students' dissecting ability, generally pointing out weak areas as they are demonstrated. ____
7. Mr. Lassez and Mrs. Faire's philosophy for evaluating their classes is that any grade would be subjective; therefore, the class is Pass/Fail. ____
8. For each unit, Mr. Jones outlines all required behaviors and objectives. He then evaluates his class on all the major, terminal skills and objectives. ____
9. Because so many concepts were taught during a climate unit, the unit test was given in two days in order to test each specific concept. This helped the teacher know if students knew the concepts sufficiently to use them in problem-solving questions on the final exam. ____
10. Mrs. Snodgrass gives her horticulture students daily quizzes on the lecture the day before. ____
11. Tests in your physics class measure every course objective and concept that you are expected to learn. ____
12. Your biology tests always focus on one of the main concepts of the unit. You assume that if students know one, they know them all. ____
1: F Teacher uses test results to re-teach information not yet
learned.
2: X Any kind of item may be used with either kind of
evaluation.
3: X One question samples students' learning.
4: X We
are not told how results are used or when test is given.
5: S Revision
is made after the unit is completed.
6: F See number 1 above.
7: X
Not enough information on how results are used.
8: S He tests just the
big ideas.
9: F All material was tested and used to predict.
10: X
Tested daily, but we don't know how test data are used.
11: F All
material is tested.
12: S This is a sampling procedure.
Before we
can determine the validity of a measure, or for that matter of a unit of
instruction, we must know what the objectives are for the unit. Instructional
objectives specify what students will be able to do when they have completed
instruction. The following is an introduction to instructional objectives. When
you finish you should be able to write objectives for instruction in your
subject major which tell three things.
1. What learners will be able to do when they are finished learning--expressed in terms of observable behavior.
2. How well they will be able to do it.
3. Under what conditions will learners be able to perform the specified behavior
An instructional objective (or "intended learning outcome," as they are often called) is a desired goal of learning which is written in terms of observable behavior by the learners. A well written objective specifies three things: 1) the observed behavior of student which indicates what they have learned, 2) important conditions under which students will show what they have learned, and 3) how well student should behave.
Observable behavior. It is difficult to know what students have actually learned in a lesson. The best we can do is infer what they have learned from their behavior. The objective, "help students understand plate tectonics" does not describe student behavior. Rather, it describes teacher plans--it describes what the teacher will do.
The statement, "Students will really understand plate tectonics" may be the real goal, but is too ambiguous to measure. It does identify student learning, but we have no clue as to how students will show what they know about movement of plates. Notice the verbs in the following statements:
1. Student will list the three forms of sexual reproduction.
2. Students will label the five main parts of a flower.
3. Student will balance 5 chemical equations.
4. Students will understand the importance of physics in their daily lives.
Which statement is measurable? If you said statements one through three, you are correct. The verbs "list," "label," and "balance" indicate observable student behavior. The verb "understand" in statement four gives us no clue as to how students will show what they know. Much more measurable would be "Students will list three ways that knowledge of physics may help them at home," or "Students will defend in a paragraph the reasons why all students should study physics."
Another example. Suppose that we are interested in having students, "know something about the invention of the cathode ray tube." Whatever may be intellectually involved in the attainment of this goal, it should be apparent that the language of our aim, as stated, leaves much to be desired.
What is the student who knows about cathode ray tubes able to do that the student who does not know, is not able to do? This is the important question because, until we have worked out a clear answer to it, we cannot measure the accomplishment of our instructional purpose. There is no single answer to the question we have posed; our objective of "knowing something" is too vague for that.
Having written a statement of student learning in terms of observable behavior, a teacher has addressed the most important part of an instructional objective. And many teacher quit here. Objectives writers who add "conditions" to their objectives, add details which will be of help when they plan how to measure student learning. Look at the objective on studying physics. The phrase "...defend in a paragraph..." helps us know more about what students will be expected to do to show that they understand the importance of physics. The statement clarifies how students will show their learning. Sometimes you will need to tell whether the task is open book or from memory, or whether there is a time limit, or with what materials the student will be provided for the test. Examine these two statements. Which lists important conditions?
a. The student will label the parts of a flower.
b. Given a list of major flower parts, the student will write the function of each.
Statement b lists conditions. It states that the students tell functions of flower parts that are selected by the teacher. Try you hand at adding conditions to statement a.
Did you write that students would be given a drawing of a typical flower or that students would be given the list of flower parts or allowed to use notes from a previous lecture? Any or all of these may have been included as conditions. The amount of time allowed for this task may not be important. If not, it isn't listed.
In the Science Core Curriculum for Utah you will note that the writers list "standards" and "objectives." "Standards" are very general course goals. They are not always written in terms of observable student behavior. The "objectives" are more specific. Verbs in the "objectives" usually indicate observable behavior: though the reader may not always know what the verb was intended to mean. See "Investigate" as an example. Neither the "standards" nor "objectives" indicate conditions or criteria. In writing the science core, the writers deliberately left the decisions about conditions and criteria up to teachers and test writers because they are better able to decide what will work (conditions) and how well students their should be able to do (criteria).
As you plan to assess student learning in your science class, note carefully the broad goals, standards, and more specific objectives in the core curriculum. Notice that they are moving you away from the measurement of trivia and toward more significant learning outcomes. This may be a good time to examine the overall Intended Learning Outcomes for Secondary Science in Utah. They indicate the most important goals for Utah science instruction. Also, you should be aware that there are a pool of test items available on the Internet which are prepared for your use in evaluating your students.
The purpose of all tests is to collect data which will help make judgments, or inferences, about what students know or can do. The task is to reduce the level of inference as much as possible. A high inference item is one which forces a teacher to draw conclusions about a student's ability to do a task from data that are taken from an unrelated task.
For instance, to assume that an science students can write effective research reports because on a series of test items they have punctuated sentences well and can spell words correctly is obviously a high inference. A lower and safer inference of a student's ability to write reports would be made from examining several written samples of the student's attempts to write various kinds of reports.
Another example: Science teachers may require students to learn the structure and functions of the anatomical parts of birds and reptiles and amphibians. If this ability to name or label anatomical parts is measured on a test and a student does well on those items, a teacher could make a fairly safe inference about a student's ability to name and label parts, but would not be able to infer safely that because a student can name and label parts of animals the students can also contrast the life cycles of different phyla of animals or compare the digestive systems of selected animals. To make a safe inference about such ability, the teacher should design test items which would require students to compare and contrast life cycles or digestive systems, and then the data from those test items would support more trustworthy inferences.
So the purpose of all test items is to provide either the student or the teacher with data from which inferences about the student's ability can be made. Paper and pencil tests in themselves are rather artificial measures of student's ability to think or plan or analyze or create. They're artificial because they take a small sample of the student's behavior, and the smaller the sample the less representative that sample is likely to be of the general task for which the student is being trained. In measuring higher-level thinking skills then, if one is to infer thinking ability from the student's ability to work items correctly, the items must represent the skill itself as completely and accurately as possible and also must sample from all of the parts of the skill being tested.
Another example: the best way to find out if the student can conduct research would be to put the student in a wide variety of problem solving situations, and to observe the student's research abilities. This is not possible in most cases. Usually, we must measure a whole group of students at a time. Also, we don't have the time or the resources to put students in a wide variety of situations and observe them as they act. Unable to do this, we write test items which come as close to that kind of a situation as possible.
Each type of evaluation item has its own unique
(1)
characteristics, (2) uses, (3) advantages, (4) limitations, and (5) rules for
construction. In this section you will study the first four ways of looking at
evaluation items. You should be able to select the kind of item that best fits
your programs.
Types of traditional evaluation items may be categorized as follows:
Pupil response evaluations consist of paper-and-pencil test items that are used to measure knowledge. Teacher observations evaluations are used to measure actual student performance, desired personal/social behaviors, and products that may be observed by the teacher. Pupil report evaluations are used when a teacher's evaluations of behaviors is apt to be inadequate unless his/her observations are supplemented by reports of the individual student being evaluated or by judgments of that student's peers.
The objective test is highly structured and requires the student to supply a word or two, or to select the correct answer from among a limited number of alternatives.
The essay test permits the student to select, organize, and present answers, with either extended or restricted responses. Both essay and objective test items may go beyond testing recall and recognition of knowledge and both should be used to measure more complex learning outcomes, such as application, analysis, and synthesis of knowledge.
Objective test items can be classified into those which require the student to supply the answer or to select the answer from a given number of alternatives. Selection items ask the student to recognize information and, frequently, to make discriminating and simple problem-solving decisions. Supply items usually require students to construct more complex responses.
True/false is the name generally used to refer to any alternative-response item consisting of a declarative statement which the student is asked to mark true/false, right/wrong, correct/incorrect, yes/no, fact/opinion, agree/disagree, etc. The most common use of the true/false item is to measure ability to identify the correctness of statements of fact, definitions of terms, statements of principles, etc.
Advantages attributed to true/false items are not, unfortunately, very valid. The advantage cited most frequently is ease of construction. This has probably resulted from the all-too-common practice of taking statements from textbooks and changing half of them to false statements.
There are many limitations of true/false items: (1) only the more elementary learning outcomes can be measured; (2) many true/false items are so obvious that everyone gets them correct or so ambiguous that even the better students are confused by them; (3) some subject matter does not lend itself to true/false items because statements cannot be made that are true or false without qualification or exception. Storey[1] has suggested that "in general, the true/false item as written and scored by most classroom teachers proves to be so lacking in validity and reliability, and so subject to response set and guessing on the part of examinees, that it serves little or no useful measurement purpose. . ." We agree and encourage you to avoid using them.
The matching item consists of two parallel columns with each word, sentence, or phrase in the left column being matched to a word(s), phrase, number, or symbol in the right column. A matching exercise seems most appropriate whenever the ability to identify the relationship between two things is being measured.
Advantages of the matching items include: (1) its compact form, which makes it possible to measure a large amount of related factual material in a sort time; and (2) ease of construction, if correct responses for each premise can serve as plausible response for the other premises.
Major limitations are: (1) factual information based on rote memorization is usually measured; (2) irrelevant clues are often present; and (3) significant material of a homogeneous nature is limited.
The multiple-choice item consists of a problem and a list of suggested solutions. The problem may be stated in the form of a direct question or an incomplete statement. From given alternatives, the student must select either the best answer or the correct answer. The best answer type is especially useful for measuring learning outcomes that require understanding, application, or interpretation of factual information. Two examples of multiple-choice items that measure understanding and application are given below:
1. Pascal's law can be used to explain the operation of:
(a) electric fans
(b) hydraulic brakes
(c) levers
(d) syringes
2, Which one of the following best illustrates the principle of capillarity?
(a) Fluid is carried through the stems of plants
(b) Food is manufactured in the leaves of plants
(c) The leaves of deciduous plants lose their green color in winter
(d) Plants give off moisture through their stomata
Advantages of multiple-choice items include: (1) may easily measure knowledge as well as a variety of complex learning outcomes; (2) avoids ambiguity and vagueness present often present in short-answer items; (3) presents to the student a greater number of alternatives than true/false items, increasing the item's reliability; (4) less chance for students to correctly guess the answer than with the true/false item; (5) it is possible to diagnose student misunderstandings by the nature of the incorrect alternatives selected; and (6) it is possible to test many ideas easily.
Two limitations of multiple-choice items are true for all objective tests: (1) They measure whether the student knows or understands what to do in a problem situation, not how the student will actually perform; (2) they require selection of a correct or best answer, so, therefore, cannot measure some problem-solving skills in mathematics and science nor the ability to organize and present ideas. A third limitation, not shared with other item types, is the difficulty of locating a sufficient number of incorrect but plausible alternatives.
Typically, a rating scale consists of a set of characteristics or qualities to be judged and some type of scale for indicating the degree to which each characteristic is present. The rating scale itself is merely a reporting device. Its value in appraising the student's performance or product depends upon the care with which it is prepared and the appropriateness with which it is used. If properly constructed and used, it serves several important evaluative functions. It (1) directs observation toward specific and clearly defined behaviors; (2) provides a common frame of reference for comparing all students on the same set of characteristics; and (3) provides a convenient method for recording judgments of observers.
A checklist Is similar in appearance and use to the rating scale. The basic difference between them is in the type of judgment called for. A rating scale provides an opportunity to indicate the degree to which a characteristic is present or the frequency with which a behavior occurs. The checklist on the other hand, calls for a simple "yes-no" judgment.
The self-report provides valuable evidence concerning the students' perception of themselves and how they want others to view them. Also, it may be used to elicit information about student behavior which the student is unable to observe. Without student self-reports, a teacher cannot get a complete picture of students' adjustments, interests, attitudes or activities. In general, two types of information may be profitably obtained by self-report techniques: (1) information about the students' out-of-class behaviors, such as the books they have read, hobbies, particular experiences, etc.; and (2) information about students' thoughts such as worries and concerns, feelings toward self and others, interests, opinions, etc.
Peer reports are frequently used to give a student additional feedback from other students on a product or performance. In the areas of leadership ability, concern for others, effectiveness in group work, and social acceptability, students frequently know each other's strengths and weaknesses better than the teacher. Checklists and rating scales are techniques that may be used within self- and peer reports.
![]()
Directions: Place the letter which corresponds to the most appropriate evaluation method for each item listed in the blank beside the item.
Answers: 1: d, 2: a, 3: a, 4: b, 5: e, 6: d, 7: e, 8: c, 9: b, 10: c.
![]()
Poor: ___________________discovered radium.
Poor: Radium was discovered by ___________________. (Poor because many answers may fit. e.g., "a Frenchman," "Curie," "Experimenting with pitchblende.")
Better: Who discovered radium? ___________________
Poor: ___________________discovered radium.
Better: Who discovered radium? ___________________
Poor: ___________________observed great diversity in ___________________in the ___________________. (Huh?)
Good: Who was given credit for the early development of the theory of evolution? ___________________
Poor: When an animal eats plants, it is said to be an ________________.
Better: When an animal eats plants, it is said to be a/an ________________.
Poor: When did Marie Curie receive the Nobel Prize? ________________. ("A long time ago.." is a correct answer!)
Better: In what year did Marie Curie receive the Nobel Prize? ___________________ .
Poor: Darwin ___________________ finches in the Galapagos ___________________.
Poor: Darwin studied ___________________ in the ___________________ islands.p
Poor: Evolution:
Better: Who developed the theory of evolution?
Poor: In what year did Darwin visit the Galapagos?
Poor: Which of the following is not a mammal?
Better: Which of the following is NOT a mammal? OR All of the following are mammals EXCEPT:
Poor:
|
Better:
|
Poor: A scientist who studies stars and galaxies is called an:
Better: A scientist who studies stars and galaxies is called a/an:
Poor: Albert Einstein was:
(Which of these alternatives is a wrong answer?)
Poor: A particle found in most atomic nuclei is referred to as a/an
Better: A neutron can best be defined as:
Poor: T F All of the mountains in the Rocky Mountains were formed by volcanic action.
Good: T F The mountains in the Rocky Mountains were formed by volcanic action.
Poor: T F Gregor Mendel was not a great astronomer.
Good: T F Gregor Mendel was not a great astronomer
Poor: T F John Glenn, an American astronaut, was a skillful pilot and is best known for his first orbital flight around the earth.
Good: T F John Glenn is best known for his first orbital flight around the earth.
Poor: T F Slavery was a major cause of the Civil War whereas the economic situation of the southern states was not.
Good: T F Slavery was one of the major causes of the Civil War
Poor: T F Gerald Ford did not win the 1966 presidential election because of the good weather in the east which brought many people to the polls on election day.
Good: T F One of the reasons Gerald Ford did not win the 1976 presidential election was the good weather in the east which brought many people to the polls on election day.
Poor: T F The economic situation of the southern states was a major cause of the Civil War.
Good: T F A major cause of the Civil War was the economic situation of the southern states.
Poor: T F According to most botanists, Mendel is considered to be the greatest botanist.
Good: T F According to Brett Moulding, Mendel is considered to be the greatest botanist.
Poor:
| ____
____ ____ ____ ____ |
|
|
Better:
| ____
____ ____ ____ ____ |
|
|
Poor:
| ____
____ ____ ____ |
|
|
Poor: Punctuation and writing style:
unclear 1 2 3 4 very clear
Better: Punctuation:
unclear 1 2 3 4 very clear
Writing Style:
unclear 1 2 3 4 very clear
Poor: How important is money to a happy marriage?
1
|
|
|
5
Better: How important is money to a happy marriage?
1 unimportant
2 of little importance
3 important
4 very important
5 the most important
Poor: How important is money to a happy marriage?
Better: How important is money to a happy marriage?
Poor: Interest in playing basketball
Better: How important is money to a happy marriage?
Directions: Check the space to the left of each behavior as that behavior is observed.
| ____
____ ____ ____ ____ ____ ____
____ |
Has proper materials (2 slices of bread, peanut
butter, knife) ready.
Lays bread on flat surface. Opens peanut butter jar. Grasps knife firmly by the handle. Inserts blade of knife into peanut butter and removes the desired amount. Has difficulty manipulating the knife. Spreads peanut butter on one side of the first slice of bread with a side-to-side motion so that peanut butter is evenly distributed. Places second slice of bread directly on top of peanut butter. Closes peanut butter jar. |
![]()
DIRECTIONS: In the following test-item examples, place the letter which corresponds to the major test-item construction fallacies in the box to the left of the test item. If there is no fallacy, use "f."
___1. In the simplest form of asexual reproduction, spore formation, the parent organism divides into two equal parts. (T or F)
___2. Which of the following vitamins is not a member of the vitamin B complex?
Completion
- There should be no clues.
- All blanks should be of uniform length.
- The omitted phrase should be no longer than three words.
- Words or phrases omitted should not be trivial words or phrases.
- There should not be so many blanks that the contest of the material is ambiguous.
- No major fallacy occurs in this test-item.
3. The "law of dominance" says that ______________ that occurs is ______________ are separated during ______________ .
4. Match the person with his contribution.
| ______
______ ______ ______ ______ |
|
|
![]()