137x Filetype PDF File size 0.52 MB Source: www.aaai.org
From: FLAIRS-01 Proceedings. Copyright © 2001, AAAI (www.aaai.org). All rights reserved. A Quagmire of Terminology: Verification & Validation, Testing, and Evaluation* Valerie Barr Department of Computer Science Hofstra University Hempstead, NY 11550 vbarr~hofstra.edu Abstract at very different levels in the software development pro- Software engineering literature presents multiple defi- cess. In one usage, the term refers to testing in the nitions for the terms verification, validation and test- small, the exercise of program code with test cases, ing. The ensuing diA~culties carry into research on with a goal of uncovering faults in code by exposing the verification and validation (V&V) of intelligent failures. In another usage, the term refers to testing systems. We explore both these areas and then ad- in the large, the entire overall process of verification, dress the additional terminology problems faced when validation, and quality analysis and assurance. attempting to carry out V&V work in a new domain such as natural language processing (NLP). Introduction The term V~V, for verification and validation, is Historically verification and validation (V&V) re- also used in both high level and low level ways. In a searchers have labored under multiple definitions of high level sense, it is used synonymously with test- key terms within the field. In addition, the termi- ing in the large. V&V can refer to a range of ac- nology used by V&V researchers working with intel- tivities that include testing in the small and soft- ligent systems can di~er from that used by software ware quality assurance. More specifically, V&V can engineers and software testing researchers. As a re- be used as an umbrella term for activities such as sult, many V&V research efforts must begin with a formal technical reviews, quality and configuration (re)definition of the terms that will be used. The need audits, performance monitoring, simulation, feasibil- to establish working definitions becomes more pressing ity study, documentation review, database review, al- if we try to apply verification, validation, and testing gorithm analysis, development testing, qualification (W&T) theory and practice to fields in which develop- testing, installation testing (Wallace & Fujii 1989; ers do not normally carry out formal VV&T activities. Pressman 2001). This is consistent with the ANSI This paper starts with a review of terminology that is definition of verification as the process of determining used in the software engineering/software testing ar- whether or not an object in a given phase of the soft- eas. It then discusses the terminology issues that exist ware development process satisfies the requirements among V&V researchers in the intelligent systems com- of previous phases ((ANSI/IEEE 1983b), as cited munity and between them and the software engineer- (Beizer 1990)). In this view, V&V activities can take ing/software testing communities. Finally, it explores place during the entire life-cycle, at each stage of the the terminology issues that can arise when we attempt development process, starting with requirements re- to apply VV&T to other domain areas, such as natural views, continuing through design reviews and code language processing systems. inspection, and finally product testing (Sommerville 2001). In this sense, software testing in the small is Terminology Conflicts - First View one activity of the V&V process. Similarly, the Na- The first term to tackle in the terminology of software tional Institute of Standards and Technology (NIST, testing is the term testing itself. Unfortunately this formerly National Bureau of Standards) defines the word is used to refer to several activities that take place high level view of VV&T as the procedure of review, analysis, and testing throughout the software life cycle Uopyright ~)2001, American Association for Artificial to discover errors, determine functionality, and ensure Intelligence (www.a.~i.org). All rights reserved. the production of quality software (NBS 1981). VERIFICATION, VALIDATION 625 Verification & Validation ing both user requirements and additional require- In a low level sense, each of the terms verification ments that are necessary for actual system develop- and validation has very specific meaning and refers ment. However, in new texts on software development to various activities that are carried out during soft- (for example (Hamlet & Maybee 2001)) this process ware development. In an early definition, verification is broken into two phases: the requirements phase is was characterized as determining if we "are building strictly user centered, while the specification phase the product fight" (Boehm 1981). In more current adds the additional requirements information that is characterizations, the verification process ensures that needed by developers. This leads to confusing defi- the software correctly implements specific functions nitions of V&V which necessitate that first the terms (Pressman 2001), characteristics of good design are in- "requirements" and "specifications" be well defined. In corporated, and the system operates the way the de- (Hamlet & Maybee 2001) the issue is addressed directly signers intended (Pfieeger 1998). by defining verification as "checking that two indepen- Note the emphasis in these definitions on aspects of dent representations of the same thing are consistent specification and design. The definition of verification in describing it." They propose comparing the require- used by the National Bureau of Standards (NBS) also ments document and the specification document for focuses on aspects that are internal to the system itself. consistency, then the specification document and the They define verification as the demonstration of consis- design document, continuing through all the phases of tency, completeness, and correctness of the software at software development. each stage and between each stage of the development Testing life cycle (NBS 1981). We next return to various attempts in the literature Validation, on the other hand, was originally char- to define testing. Most software engineering texts do acterized as determining if we "are building the right not give an actual definition of testing and do not dis- product" (Boehm 1981). This has been taken to have tinguish between testing in the large and testing in the various meanings related to the customer or ultimate small. Rather, they simply launch into lengthy discus- end-user of the system. For example, in one defini- sion of what activities fall under the rubric of testing. tion validation is seen as ensuring that the software, as For example, Pfieeger (Pfleeger 1998) states that the built, is traceable to customer requirements (Pressman different phases of testing lead to a validated and veri- 2001) (as contrasted with the designer requirements fied system. The closest we get to an actual definition specifications used in verification). Another definition of testing (Pressman 2001) is that it is an "ultimate more vaguely requires that the system meets the expec- review of specification, design, and code generation". tations of the customer buying it and is suitable for its Generally, discussions of testing divide it into several intended purpose (Sommerville 2001). Pfleeger adds phases, such as the following (Pressman 2001): the notion (Pfleeger 1998) that the system implements all of the requirements, creating a two way relation- ¯ unit testing, to verify that components work prop- ship between requirements and system code (all code erly with expected types of input is traceable to requirements and all requirements are implemented). Pfleeger further distinguishes require- ¯ integration testing, to verify that system compo- ments vRlidatlon which makes sure that the require nents work together as indicated in system speci- ments actually meet the customers’ needs. These var- fieations ious definitions generally comply with the the ANSI ¯ validation testing, to validate that software conforms standard definition (ANSI/IEEE 1983a) of validation to the requirements and functions in the way the end (as cited in (Beizer 1990)) as the process of evaluat- user expects it to (also referred to as function test ing software at the end of the development process to and performance test (Pfleeger 1998)). ensure compliance with requirements. The National Bureau of Standards d_~qnltion agrees in large part ¯ system testing, in which software and other system with these user-centered definitions of validation, say- elements are tested as complete entity in order to ing that it is the determination of the correctness of verify that the desired overall function and perfor- the final program or software with respect to the user mance of the system is achieved (also called accep- needs and requirements. tance testing (Pfleeger 1998)). As other terms within software engineering are more carefully defined, there is a subsequent impact on Rather than actually define testing, Sommerville definitions of V&V. For example, the "requirements (Sommerville 2001) presents two techniques within the phase" often refers to the entire process of determin- V&V process. The first is software inspections which 626 FLAIRS-2001 are static processes for checking requirements docu- each usage to provide sufficient context and indicate ments, design diagrams, and program source code. The whether a high-level or low-level usage is intended. second is what we consider testing in the small, which involves executing code with test data and looking at V&V of Intelligent Systems output and operational behavior. The quagmire of terminology continues when we fo- Pfleeger breaks down the testing process slightly dif- cns on the development of intelligent systems. As dis- ferently, using three phases (Pfleeger 1998): cussed in (Gonzalez & Barr 2000), a similarly varied ¯ testing programs, set of definitions exists. Many of the definitions are de- ¯ testing systems, rived from Boehm’s original definitions (Boehm 1981) ¯ evaluating products and processes. of verification and validation, although conflicting deft- nitions do exist. It is also the case that, in this area, the The first two of these phases are equivalent to Press- software built is significantly different from the kinds man’s four phases listed above. However, Pfleeger’s of software dealt with in conventional software devel- third phase introduces a new concept, that of eva/- opment models. Intelligent systems development deals uation. In the context of software engineering and with more than just the issues of specifications and software testing, evaluation is designed to determine user needs and expectations. if goals have been met for productivity of the develop- The chief distinction between "conventional" soft- ment group, performance of the system, and software ware and intelligent systems is that construction of an quality. In addition, the evaluation process determines intelligent system is based on our (human) interpre- if the project under review has aspects that are of sufii- tation or model of the problem domain. The systems cient quality that they can be reused in future projects. built are expected to behave in a fashion that is equiva- The overall purpose of evaluation is to improve the lent to the behavior of an expert in the field. Gonzalez software development process so that future develop- and Barr argue, therefore, that it follows that human ment efforts will run more smoothly, cost less, and lead performance should be used as the benchmark for per- to greater return on investment for the entity funding formance of an intelligent system. Given this distinc- the software project. tion, and taking into account the definitions of other Peters and Pedrycz (Peters & Pedrycz 2000) present V&V researchers within the intelligent systems area, one of the vaguer sets of definitions. They define val- they propose definitions of verification and validation idation as occurring "whenever a system component of intelligent systems as follows: is evaluated to ensure that it satisfies system require- ments". They then define verification as "checking ¯ Verification is the process of ensuring 1) that the whether the product of a particular phase satisfies the intelligent system conforms to specifications, and 2) conditions imposed at the beginning of that phase". its knowledge base is consistent and complete within There is no discussion of the source of the require- itself. ments and the source of the conditions, so it is unclear which step involves comparison to the design and which ¯ Validation is the process of ensuring that the out- involves comparison to the customer’s needs. Their put of the intelligent system is equivalent to those of discussion of testing provides no clarification as they human experts when given the same inputs. simply state that testing determines when a software system can be released and gauges future performance. The proposed definition of verification essentially re- This brief discussion indicates that there is a fair tains the standard definition used in software engineer- amount of agreement, within the software engineering ing, but adds to it the requirement that the knowledge community, on what is meant by verification and val- base be consistent and complete (that is, free of in- idation. Verification refers, overwhelmingly, to check- ternal errors). The proposed definition of validation is ing and establishing the relationship between the sys- consistent with the standard definition if we consider tem and its specification (created during the design human performance as the standard for the "customer process), while validation refers to the relationship be- requirements" or user expectations that must be satis- tween the system’s functionality and the needs and ex- fied by the system’s performance. pectations of the end user. However, there are some au- Therefore, we can apply the usual definitions of V&V thors whose use of the terms is not consistent with this to intelligent systems with slight modifications to take usage. In addition, all of the key terms (testing, ver- into account the presence of a knowledge base and the ification, validation, evaluation, specification, require- necessity of comparing system performance to that of ments) are overloaded. Every effort must be made in humans in the problem domain. VERIFICATION, VALIDATION 627 Applying V&V in a New Area perform as well as predicted or desired, and compare As shown, the area of VV&T is based on overloaded different approaches for solving a single problem. terminology, with generally accepted definitions as What becomes apparent is that there are several key well as conflicting definitions throughout the litera- differences between testing and evaluation. One obvi- ture, both in the software engineering field and in the ous difference between testing and evaluation is that intelligent systems V&V community. The questions evaluation takes place late in the development life cy- then arise, how should we proceed and what dli~cuities cle, after a system is largely complete. On the other might be encountered in an attempt to apply VV&T hand, many aspects of testing (such as requirements efforts in a new problem domain? In this section we analysis and inspection, unit testing and integration discuss the difficulties that arose, and the specific ter- testing) are undertaken early in the life cycle. A second minology issues, in a shift into the area of natural lan- difference is that evaluation data is based on domain guage processing (NLP) systems. coverage, whereas some of the data used in systematic Language, as a research area, is studied in many software testing is based on code coverage. contexts. Of interest to us is the work that takes place The perspective f~om which a system is either tested at the intersection of linguistics and computer science. or evaluated is also very important in this comparison. The overall goal (Allen 1995) is to develop a computa- In systematic software testing a portion of testing in- tional theory of language, tackling areas such as speech volves actual code coverage which is determined based recognition, natural language understanding, natural on the implementation paradigm. For example, there language generation, speech synthesis, information re- are testing methods for systems written in procedu- trieval, information extraction, and inference (Jurafsky ral languages such as C, in object oriented languages & Martin 2OOO). such as C++ and 3ava, and developed using UML. We subdivide language processing activities into two However, NLP systems are evaluated based on the ap- categories, those in which text and components of text plication domain. For example, a speech interface will are analyzed, and those in which the analysis mecha- be evaluated with regard to accuracy, coverage, and nisms are applied to solve higher level problems. For speed (James, Rayner, & Hockey 2000) regardless example, text analysis methods include morphology, its implementation language. part of speech tagging, phrase chunking, parsing, se- Finally, we contrast the respective goals of testing mantic analysis, and discourse analysis. These analysis and evaluation. As stated above, the goal of program methods are in turn used in application areas such as level testing is to ultimately identify and correct faults machine translation, information extraction, question in the system. The goal of evaluation of an NLP sys- and answer systems, automatic indexing, text summa- tem is to determine how well the system works, and rization, and text generation. determine what will happen and how the system will Many NLP systems have been built to date, both perform when it is removed from the development en- for research purposes and for actual use in application vironment and put into use in the setting for which domains. However, the literature indicates (Sundheim it is intended. Evaluation is user-oriented, with a fo- 1989; Jones & Galliers 1996; Hirschman & Thompson cus on domain coverage. Given its focus on the user, 1998) that these systems are typically subjected to an evaluation is most like the validation aspect of VV&T. evaluation process using a test suite that is built to As part of evaluation work, organized (competitive) maximize domain coverage. This immediately raises comparisons are carried out of multiple systems which the questions of what is meant by the term evaluation perform the same task. For example, the series of Mes- as it is used in the NLP community, whether it is equiv- sage Understanding Conferences (MUC) involved the alent to testing in the small or to testing in the large, evaluation of information extraction systems. Simi- and where it fits in the VV&T terminology quagmire. larly the Text Retrieval Conferences (TREC) carry out NLP systems have largely been evaluated using a large-scale evaluation of text retrieval systems. These black-box, functional, approach, often supplemented efforts allow for comparison of different approar~es to with aa analysis of how acceptable the output is to particular language processing problems. users ((Hirschman & Thompson 1998; White & Taylor Functional, black-box, evaluation is a very impor- 1998). The evaluation process must determine whether tant and powerful analysis method, particularly be- the system serves the intended function in the intended cause it works from the perspective of the user, without environment. There are several evaluation taxonomies concern for implementation. However, a more com- (Cole et al. 1998; Jones & Galliers 1996), but the plete methodology would also take into account im- common goals are to determine if the system meets plementation details and conventional program based objectives, identify areas in which the system does not testing. Without this we can not be sure that the 628 FLAIRS-2001
no reviews yet
Please Login to review.