296x Filetype PDF File size 0.52 MB Source: www.aaai.org
From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved.
A Quagmire of Terminology: Verification & Validation, Testing, and
Evaluation*
Valerie Barr
Department of Computer Science
Hofstra University
Hempstead, NY 11550
vbarr~hofstra.edu
Abstract at very different levels in the software development pro-
Software engineering literature presents multiple defi- cess. In one usage, the term refers to testing in the
nitions for the terms verification, validation and test- small, the exercise of program code with test cases,
ing. The ensuing diA~culties carry into research on with a goal of uncovering faults in code by exposing
the verification and validation (V&V) of intelligent failures. In another usage, the term refers to testing
systems. We explore both these areas and then ad- in the large, the entire overall process of verification,
dress the additional terminology problems faced when validation, and quality analysis and assurance.
attempting to carry out V&V work in a new domain
such as natural language processing (NLP).
Introduction The term V~V, for verification and validation, is
Historically verification and validation (V&V) re- also used in both high level and low level ways. In a
searchers have labored under multiple definitions of high level sense, it is used synonymously with test-
key terms within the field. In addition, the termi- ing in the large. V&V can refer to a range of ac-
nology used by V&V researchers working with intel- tivities that include testing in the small and soft-
ligent systems can di~er from that used by software ware quality assurance. More specifically, V&V can
engineers and software testing researchers. As a re- be used as an umbrella term for activities such as
sult, many V&V research efforts must begin with a formal technical reviews, quality and configuration
(re)definition of the terms that will be used. The need audits, performance monitoring, simulation, feasibil-
to establish working definitions becomes more pressing ity study, documentation review, database review, al-
if we try to apply verification, validation, and testing gorithm analysis, development testing, qualification
(W&T) theory and practice to fields in which develop- testing, installation testing (Wallace & Fujii 1989;
ers do not normally carry out formal VV&T activities. Pressman 2001). This is consistent with the ANSI
This paper starts with a review of terminology that is definition of verification as the process of determining
used in the software engineering/software testing ar- whether or not an object in a given phase of the soft-
eas. It then discusses the terminology issues that exist ware development process satisfies the requirements
among V&V researchers in the intelligent systems com- of previous phases ((ANSI/IEEE 1983b), as cited
munity and between them and the software engineer- (Beizer 1990)). In this view, V&V activities can take
ing/software testing communities. Finally, it explores place during the entire life-cycle, at each stage of the
the terminology issues that can arise when we attempt development process, starting with requirements re-
to apply VV&T to other domain areas, such as natural views, continuing through design reviews and code
language processing systems. inspection, and finally product testing (Sommerville
2001). In this sense, software testing in the small is
Terminology Conflicts - First View one activity of the V&V process. Similarly, the Na-
The first term to tackle in the terminology of software tional Institute of Standards and Technology (NIST,
testing is the term testing itself. Unfortunately this formerly National Bureau of Standards) defines the
word is used to refer to several activities that take place high level view of VV&T as the procedure of review,
analysis, and testing throughout the software life cycle
Uopyright ~)2001, American Association for Artificial to discover errors, determine functionality, and ensure
Intelligence (www.a.~i.org). All rights reserved. the production of quality software (NBS 1981).
VERIFICATION, VALIDATION 625
Verification & Validation ing both user requirements and additional require-
In a low level sense, each of the terms verification ments that are necessary for actual system develop-
and validation has very specific meaning and refers ment. However, in new texts on software development
to various activities that are carried out during soft- (for example (Hamlet & Maybee 2001)) this process
ware development. In an early definition, verification is broken into two phases: the requirements phase is
was characterized as determining if we "are building strictly user centered, while the specification phase
the product fight" (Boehm 1981). In more current adds the additional requirements information that is
characterizations, the verification process ensures that needed by developers. This leads to confusing defi-
the software correctly implements specific functions nitions of V&V which necessitate that first the terms
(Pressman 2001), characteristics of good design are in- "requirements" and "specifications" be well defined. In
corporated, and the system operates the way the de- (Hamlet & Maybee 2001) the issue is addressed directly
signers intended (Pfieeger 1998). by defining verification as "checking that two indepen-
Note the emphasis in these definitions on aspects of dent representations of the same thing are consistent
specification and design. The definition of verification in describing it." They propose comparing the require-
used by the National Bureau of Standards (NBS) also ments document and the specification document for
focuses on aspects that are internal to the system itself. consistency, then the specification document and the
They define verification as the demonstration of consis- design document, continuing through all the phases of
tency, completeness, and correctness of the software at software development.
each stage and between each stage of the development Testing
life cycle (NBS 1981). We next return to various attempts in the literature
Validation, on the other hand, was originally char- to define testing. Most software engineering texts do
acterized as determining if we "are building the right not give an actual definition of testing and do not dis-
product" (Boehm 1981). This has been taken to have tinguish between testing in the large and testing in the
various meanings related to the customer or ultimate small. Rather, they simply launch into lengthy discus-
end-user of the system. For example, in one defini- sion of what activities fall under the rubric of testing.
tion validation is seen as ensuring that the software, as For example, Pfieeger (Pfleeger 1998) states that the
built, is traceable to customer requirements (Pressman different phases of testing lead to a validated and veri-
2001) (as contrasted with the designer requirements fied system. The closest we get to an actual definition
specifications used in verification). Another definition of testing (Pressman 2001) is that it is an "ultimate
more vaguely requires that the system meets the expec- review of specification, design, and code generation".
tations of the customer buying it and is suitable for its Generally, discussions of testing divide it into several
intended purpose (Sommerville 2001). Pfleeger adds phases, such as the following (Pressman 2001):
the notion (Pfleeger 1998) that the system implements
all of the requirements, creating a two way relation- ¯ unit testing, to verify that components work prop-
ship between requirements and system code (all code erly with expected types of input
is traceable to requirements and all requirements are
implemented). Pfleeger further distinguishes require- ¯ integration testing, to verify that system compo-
ments vRlidatlon which makes sure that the require nents work together as indicated in system speci-
ments actually meet the customers’ needs. These var- fieations
ious definitions generally comply with the the ANSI ¯ validation testing, to validate that software conforms
standard definition (ANSI/IEEE 1983a) of validation to the requirements and functions in the way the end
(as cited in (Beizer 1990)) as the process of evaluat- user expects it to (also referred to as function test
ing software at the end of the development process to and performance test (Pfleeger 1998)).
ensure compliance with requirements. The National
Bureau of Standards d_~qnltion agrees in large part ¯ system testing, in which software and other system
with these user-centered definitions of validation, say- elements are tested as complete entity in order to
ing that it is the determination of the correctness of verify that the desired overall function and perfor-
the final program or software with respect to the user mance of the system is achieved (also called accep-
needs and requirements. tance testing (Pfleeger 1998)).
As other terms within software engineering are more
carefully defined, there is a subsequent impact on Rather than actually define testing, Sommerville
definitions of V&V. For example, the "requirements (Sommerville 2001) presents two techniques within the
phase" often refers to the entire process of determin- V&V process. The first is software inspections which
626 FLAIRS-2001
are static processes for checking requirements docu- each usage to provide sufficient context and indicate
ments, design diagrams, and program source code. The whether a high-level or low-level usage is intended.
second is what we consider testing in the small, which
involves executing code with test data and looking at V&V of Intelligent Systems
output and operational behavior. The quagmire of terminology continues when we fo-
Pfleeger breaks down the testing process slightly dif- cns on the development of intelligent systems. As dis-
ferently, using three phases (Pfleeger 1998): cussed in (Gonzalez & Barr 2000), a similarly varied
¯ testing programs, set of definitions exists. Many of the definitions are de-
¯ testing systems, rived from Boehm’s original definitions (Boehm 1981)
¯ evaluating products and processes. of verification and validation, although conflicting deft-
nitions do exist. It is also the case that, in this area, the
The first two of these phases are equivalent to Press- software built is significantly different from the kinds
man’s four phases listed above. However, Pfleeger’s of software dealt with in conventional software devel-
third phase introduces a new concept, that of eva/- opment models. Intelligent systems development deals
uation. In the context of software engineering and with more than just the issues of specifications and
software testing, evaluation is designed to determine user needs and expectations.
if goals have been met for productivity of the develop- The chief distinction between "conventional" soft-
ment group, performance of the system, and software ware and intelligent systems is that construction of an
quality. In addition, the evaluation process determines intelligent system is based on our (human) interpre-
if the project under review has aspects that are of sufii- tation or model of the problem domain. The systems
cient quality that they can be reused in future projects. built are expected to behave in a fashion that is equiva-
The overall purpose of evaluation is to improve the lent to the behavior of an expert in the field. Gonzalez
software development process so that future develop- and Barr argue, therefore, that it follows that human
ment efforts will run more smoothly, cost less, and lead performance should be used as the benchmark for per-
to greater return on investment for the entity funding formance of an intelligent system. Given this distinc-
the software project. tion, and taking into account the definitions of other
Peters and Pedrycz (Peters & Pedrycz 2000) present V&V researchers within the intelligent systems area,
one of the vaguer sets of definitions. They define val- they propose definitions of verification and validation
idation as occurring "whenever a system component of intelligent systems as follows:
is evaluated to ensure that it satisfies system require-
ments". They then define verification as "checking ¯ Verification is the process of ensuring 1) that the
whether the product of a particular phase satisfies the intelligent system conforms to specifications, and 2)
conditions imposed at the beginning of that phase". its knowledge base is consistent and complete within
There is no discussion of the source of the require- itself.
ments and the source of the conditions, so it is unclear
which step involves comparison to the design and which ¯ Validation is the process of ensuring that the out-
involves comparison to the customer’s needs. Their put of the intelligent system is equivalent to those of
discussion of testing provides no clarification as they human experts when given the same inputs.
simply state that testing determines when a software
system can be released and gauges future performance. The proposed definition of verification essentially re-
This brief discussion indicates that there is a fair tains the standard definition used in software engineer-
amount of agreement, within the software engineering ing, but adds to it the requirement that the knowledge
community, on what is meant by verification and val- base be consistent and complete (that is, free of in-
idation. Verification refers, overwhelmingly, to check- ternal errors). The proposed definition of validation is
ing and establishing the relationship between the sys- consistent with the standard definition if we consider
tem and its specification (created during the design human performance as the standard for the "customer
process), while validation refers to the relationship be- requirements" or user expectations that must be satis-
tween the system’s functionality and the needs and ex- fied by the system’s performance.
pectations of the end user. However, there are some au- Therefore, we can apply the usual definitions of V&V
thors whose use of the terms is not consistent with this to intelligent systems with slight modifications to take
usage. In addition, all of the key terms (testing, ver- into account the presence of a knowledge base and the
ification, validation, evaluation, specification, require- necessity of comparing system performance to that of
ments) are overloaded. Every effort must be made in humans in the problem domain.
VERIFICATION, VALIDATION 627
Applying V&V in a New Area perform as well as predicted or desired, and compare
As shown, the area of VV&T is based on overloaded different approaches for solving a single problem.
terminology, with generally accepted definitions as What becomes apparent is that there are several key
well as conflicting definitions throughout the litera- differences between testing and evaluation. One obvi-
ture, both in the software engineering field and in the ous difference between testing and evaluation is that
intelligent systems V&V community. The questions evaluation takes place late in the development life cy-
then arise, how should we proceed and what dli~cuities cle, after a system is largely complete. On the other
might be encountered in an attempt to apply VV&T hand, many aspects of testing (such as requirements
efforts in a new problem domain? In this section we analysis and inspection, unit testing and integration
discuss the difficulties that arose, and the specific ter- testing) are undertaken early in the life cycle. A second
minology issues, in a shift into the area of natural lan- difference is that evaluation data is based on domain
guage processing (NLP) systems. coverage, whereas some of the data used in systematic
Language, as a research area, is studied in many software testing is based on code coverage.
contexts. Of interest to us is the work that takes place The perspective f~om which a system is either tested
at the intersection of linguistics and computer science. or evaluated is also very important in this comparison.
The overall goal (Allen 1995) is to develop a computa- In systematic software testing a portion of testing in-
tional theory of language, tackling areas such as speech volves actual code coverage which is determined based
recognition, natural language understanding, natural on the implementation paradigm. For example, there
language generation, speech synthesis, information re- are testing methods for systems written in procedu-
trieval, information extraction, and inference (Jurafsky ral languages such as C, in object oriented languages
& Martin 2OOO). such as C++ and 3ava, and developed using UML.
We subdivide language processing activities into two However, NLP systems are evaluated based on the ap-
categories, those in which text and components of text plication domain. For example, a speech interface will
are analyzed, and those in which the analysis mecha- be evaluated with regard to accuracy, coverage, and
nisms are applied to solve higher level problems. For speed (James, Rayner, & Hockey 2000) regardless
example, text analysis methods include morphology, its implementation language.
part of speech tagging, phrase chunking, parsing, se- Finally, we contrast the respective goals of testing
mantic analysis, and discourse analysis. These analysis and evaluation. As stated above, the goal of program
methods are in turn used in application areas such as level testing is to ultimately identify and correct faults
machine translation, information extraction, question in the system. The goal of evaluation of an NLP sys-
and answer systems, automatic indexing, text summa- tem is to determine how well the system works, and
rization, and text generation. determine what will happen and how the system will
Many NLP systems have been built to date, both perform when it is removed from the development en-
for research purposes and for actual use in application vironment and put into use in the setting for which
domains. However, the literature indicates (Sundheim it is intended. Evaluation is user-oriented, with a fo-
1989; Jones & Galliers 1996; Hirschman & Thompson cus on domain coverage. Given its focus on the user,
1998) that these systems are typically subjected to an evaluation is most like the validation aspect of VV&T.
evaluation process using a test suite that is built to As part of evaluation work, organized (competitive)
maximize domain coverage. This immediately raises comparisons are carried out of multiple systems which
the questions of what is meant by the term evaluation perform the same task. For example, the series of Mes-
as it is used in the NLP community, whether it is equiv- sage Understanding Conferences (MUC) involved the
alent to testing in the small or to testing in the large, evaluation of information extraction systems. Simi-
and where it fits in the VV&T terminology quagmire. larly the Text Retrieval Conferences (TREC) carry out
NLP systems have largely been evaluated using a large-scale evaluation of text retrieval systems. These
black-box, functional, approach, often supplemented efforts allow for comparison of different approar~es to
with aa analysis of how acceptable the output is to particular language processing problems.
users ((Hirschman & Thompson 1998; White & Taylor Functional, black-box, evaluation is a very impor-
1998). The evaluation process must determine whether tant and powerful analysis method, particularly be-
the system serves the intended function in the intended cause it works from the perspective of the user, without
environment. There are several evaluation taxonomies concern for implementation. However, a more com-
(Cole et al. 1998; Jones & Galliers 1996), but the plete methodology would also take into account im-
common goals are to determine if the system meets plementation details and conventional program based
objectives, identify areas in which the system does not testing. Without this we can not be sure that the
628 FLAIRS-2001
no reviews yet
Please Login to review.