HiQIQ - High Quality Information Querying
Home
Project details
People
Information Quality
Publications
Related sites
Demo

Information Quality

Defining information quality (IQ) is extremely difficult due to the subjective nature of quality. Definitions such as "fitness for use" are general enough to be correct, but they are unusable when IQ is to be evaluated.

Information quality (IQ) has many facets. To capture these many dimensions, quality is often described as some set of criteria, for instance by Wang and Strong (1996). What follows is a long list of IQ criteria, as they have surfaced during research. Not all criteria must be used in all systems. Also, new criteria can easily be included to this list.

For links to other sites and projects concerning information quality, click here.

Content-related Criteria

Accuracy 
is the quotient of the number of correct values in the source and the overall number of values in the source. For our context this is the percentage of data without data errors such as non-unique keys or out of range values. 

 
Increasing accuracy is a main goal of many research efforts. Accuracy is often used synonymously with data quality, as opposed to information quality. For us, data quality or accuracy is only one aspect of the overall information quality, which includes the entire set of criteria in this list. 
Considering the accuracy criterion in a WWW information system setting has the same importance as for traditional databases. Accuracy is one of the main intrinsic properties of information. Incorrect information is hard to detect, useless, and in many cases even harmful. 
Synonyms: data quality, error rate, correctness, reliability, integrity, precision 
Completeness
is the quotient of the number of non-null values in a source and the number of non-null values in the combination of all available sources. Applied to the relational schema of Chapter the number of non-null values in the combination of all available sources corresponds to the size of the universal relation: The number of attributes multiplied with the number of tuples if all available sources were queried. The number of non-null values in a source is then the number of values a source can insert into the schema of the universal relation. 
We define completeness more formally in Chapter . There we analyze this criterion in great depth and apply it to several application domains. In Chapter  we perform optimization to maximize completeness
Completeness is of great importance in information systems that integrate multiple information sources. One of the main goals for integration is to increase completeness: Querying only one source typically gives only one part of the result. Querying another source will provide another, possibly overlapping part. The more sources we query, the more complete the result will be. 
Synonyms: coverage, scope, granularity, comprehensiveness, density, extent 
Customer support 
is the amount and usefulness of human help via email or phone. This criterion is closely related to the documentation criterion below. It is one part of an overall help system to guide users in understanding and using information. Depending on the type of support, one part of the measure could be the average waiting time for a response. Another, more difficult part to be assessed, is how useful the help is. 
For a discussion on the importance of this criterion, see the documentation criterion below, where the same arguments apply. 
Documentation
is the amount and usefulness of documents with meta information. For WWW information systems documentation usually is in the form of "help"-links that lead to Webpages that explain the provided data. As a simple measure we count the number of words in the documentation. Issues of usefulness and understandability are already covered by other criteria. We extend the scope of those scores to the documentation part of the source. 
The importance of the documentation criterion depends on the application: Often the presentation of information is self-describing and it is not necessary to measure how well a source documents its information. For instance, this is the case for search engines. On the other hand, there are domains where integration and use of the source is not possible without good documentation. Molecular biology information sources have great problems with synonyms and homonyms and other types of heterogeneity. Without a good documentation query results are very prone to misunderstanding. 
Synonyms: clarity of definition, traceability 
Interpretability
is the degree to which the information conforms to technical ability of the consumer. Technical abilities include languages spoken, units understood, etc. A highly interpretable source must also provide clear and simple definitions of all elements of the information. In this sense interpretability is similar to documentation and understandability
In integrated information systems, interpretability of a source is not as important as other criteria, because we assume that much of the issues are hidden by wrappers and the mediator. The wrappers of a source can already convert units to suit the user, text can be automatically translated at least to a useful extent, etc. It is then up to the wrapper to present the integrated information in an interpretable way. Concluding, an information source with a high interpretability is more easy to include in a mediated system but the criterion play a less important role once the source is successfully integrated. 
Synonyms: clarity of definition, simplicity 
Relevancy (or relevance)
is the degree to which the provided information satisfies the users need. Relevancy is an often used criterion in the field of information retrieval. There, a document or piece of information is considered to be relevant to the query, if the keywords of the query appear often and/or in prominent positions in the document. 

 
The importance of relevancy criterion depends on the application domain. For instance, for search engines relevancy is quite important, returned Webpage links should be as relevant as possible, even though this is difficult to achieve. For instance a query for the term "jaguar" at any WWW search engine will retrieve document links both for the animal and the automobile. If the user had the animal in mind, the links to automobile sites should have been considered as not relevant. In other application domains, relevancy is implicitly high. For instance a query for IBM stock quotes in an integrated stock information systems will only return relevant results, namely IBM stock quotes. The reason for this discrepancy is the definition of the domain: Search engines have the entire WWW as a domain and thus provide much information that is of no interest to the user. The domain of a stock information system is much more clear cut and much smaller, so a query is less likely to produce irrelevant results. 

For our purposes we reduce the relevancy criterion to a correctness criterion. If a result is correct with respect to the user query, we assume it is also relevant. If it is actually not relevant, the user query was either incorrect with respect to what the user had in mind or it was not specific enough. Relevance feedback techniques were developed by Salton and McGill to make a query more specific and increase relevancy

Synonyms: domain precision, minimum redundancy, applicability, helpfulness 

Value-Added
is a criterion that measures the amount of monetary benefit the use of the information provides. This criterion is typical for decision support type of information systems where a cost-benefit-calculation is undertaken. The value-added criterion must be considered when there is cost involved obtaining the information and when the nature of the information is yet unknown. 
Often value-added cannot be attributed to the source of the information but only to the information itself. A stock information system will provide stock quotes but cannot influence them and thus cannot increase "value-addedness"; a search engine has no influence on how useful its results are. For this reason this criterion is often not considered for WWW information systems. 

top of page

Technical Criteria

Availability
of an information source is the probability that a feasible query is correctly answered in a given time range. Availability is a technical measure concerning hardware and software of the source and the network connections between user, mediator, wrappers, and sources. Typically, availability is also time-dependent due to different usage patterns of the information source. 
Availability is an important criterion for WWW information sources for many reasons: Time-of day and week dependent network congestion; world-wide distribution of servers; high concurrent usage; denial-of-service attacks; planned maintenance interruptions. Query execution in integrated systems is especially vulnerable to low availability because usually all participating sources of a query execution plan must be available in order for the query to be executed correctly. In Section  we pay special attention to the availability criterion and propose an algorithm that dynamically adapts its optimization strategy in case of an unavailable source. 
Synonyms: accessibility, technical reliability, retrievability, performability 
Latency
is the amount of time in seconds from issuing the query until the first information reaches the user. If the result of the query is only one piece of information, e.g., one stock quote, latency equals response time (see below). 
Latency is an important criterion in WWW information system settings for two reasons: Information is sent over the internet using hypertext transfer protocol (http). This protocol sends data packaged in chunks of up to 64 kilobyte. If the entire response has a larger size, the first package can be displayed before further packages arrive. Additionally, many sources withhold the entire result and only return the first part. For instance,search engines typically allow return only the first 10 links. If the users desires more results, another query must be posed by following a link. The second reason is that in many application the user is actually only interested in the first part of the information or only in an overview of the result. Again, search engines are a good example. Often, the first 10 results are enough to satisfy the user, especially if the results are ranked well. For many other applications, not the actual result, but the number of results is the only interest of the user. Consider a user querying a stock information system for companies whose stock have risen more than 50% during the last year. Most often, not the actual companies but their number is of interest. 
Synonyms: Often response time and latency are used synonymously. 
Price
is the amount of money a user has to pay for a query as determined by the provider. Commercial data sources usually either charge on a subscription basis for their information or on a pay-per-query or pay-per-byte basis. Often there is a direct tradeoff between price and other IQ criteria. Free stock information services provide stock quotes with some delay (usually 15 minute) while subscription systems provide the quotes in realtime. Also there may be a hidden cost in retrieving information: Users spend time online paying for the internet connection and users are exposed to advertisements. 
Considering price is important if at least one integrated information source charges money for information. It is common opinion that the world wide Web has prospered due to its free information services. Information sources earn money by display advertisement. Experts predict a change towards high quality information sources that charge money for their services. 
Synonyms: query value-to-cost ratio, cost-effectivity 
Response time
measures the delay in seconds between submission of a query by the user and reception of the complete response from the information system. The score for this criterion depends on unpredictable factors such as network traffic, server workload etc. Another factor is the type and complexity of the user query. Again this cannot not be predicted, however, it can be taken into account, once the query is posed and a query execution plan is developed. Finally, the technical equipment of the information server plays a role as well. However, in WWW settings network delay usually dominates all other factors. 
Response time is the main criterion for traditional database optimizers. While for WWW information systems it is just one aspect among many other IQ criteria, it is still of some significance. Because of frequent time-outs and unknown availability of sources, users waiting long for a response from a WWW information source are more prone to abort the query than database users. This cancellation can be prevented by low latency, which gives users at least some results early on. Another reason for the importance of low response time is the potential competition on the Web. With many alternative sites the users will quickly switch from one source to another to find the desired information. An integrated system such as a meta search engine avoids this effect but must also consider response time when deciding which sources to use to answer a query. 
Synonyms: performance, turnaround time 
Security
is the degree to which information is passed privately from users to the information source and back. Security covers technical aspects such as cryptography, secure login etc., but also the possibility of anonymization and authentification of the information source by a trusted organization. Most WWW information sources publish a privacy policy to show that they are concerned with the topic. 
The importance of security is very application domain dependent: Users of search engine typically are not concerned about privacy-quite the contrary: The meta search engine MetaCrawler provides a utility that allow users to watch queries as they are passed to the engine. In other application domains users are very sensitive towards security: User typically prefer their stock quote lookups to be secure. Complex queries against molecular biology information systems can already spell out a valuable idea. 
Synonyms: privacy, access security 
Timeliness
is the average age of the information in a source. The unit of timeliness depends on the application: for some seconds are appropriate, for others days are sufficiently precise. Here, the age of data is not the time between creation of the data and now but the time between the last update or verification of the data. For instance the timeliness of search engines is their update-frequency, i.e., the frequency with which they re-index Web pages. It is not the age of the Web page itself. For stock information systems, timeliness is a measure for the delay with which stock quotes are presented. Typical free services have a 15 minute delay between the occurrence of a quote and its delivery to the user, while subscription quote services have much less or even no delay. In a fast growing area such as molecular biology it is reasonable to use the update-frequency of data source rather than the average age of the data as criterion. 
Timeliness is arguably one of the most important criteria for WWW information sources. The main advantage of the Internet over traditional information sources like newspapers or journals is its ability to provide new information almost instantly and world-wide. A main reason of users to turn to WWW information services is to obtain up-to-date information. For search engines, high timeliness for instance means less dead links, for stock information systems high timeliness allows quicker reactions to changes on the stock market. 
Synonyms: up-to-date, freshness, currentness 

top of page

Intellectual criteria

Believability
is the degree to which the information is accepted as correct by the user. In a sense, believability is the expected accuracy. Therefore it can be determined by the same unit as accuracy, but generally, the believability criterion will be influenced by many other factors so that a generic "grade" will be more appropriate. 
When querying autonomous information sources believability is an important criterion. Apart from simply providing information, a source must convince the user, that this information is "accepted or regarded as true, real, and credible". 
Synonyms: error rate, credibility, trustworthiness 
Objectivity
is the degree to which information is unbiased and impartial. The criterion score mainly depends on the affiliation of the information provider. Also, the criterion is strongly related to the verifiability criterion: The more verifiable a source is, there more objective it will be. Again, objectivity is measured by some grade as there is nor "real" unit for this criterion. 
Objectivity is an important criterion if users fear some malice of the information source. This fear could be approached by simply not using an information source with low objectivity or at least by verifying the information. Search engines often display biased information for two reasons: (i) Web pages indexed by the search engine add certain keywords to their page to be ranked higher for searches. A popular example is to repeat the word "sex" thousands of times on a web page. (ii) Search engines can be payed by Web site providers to purposefully rank their pages higher than others overriding the standard ranking algorithm employed by the search engine. Such bias is difficult to detect since search engines do not publish their ranking algorithms. Stock quotes on the other hand can easily be verified, so bias is not very likely and thus, objectivity is not an important criterion for that domain. 
Reputation
is the degree to which the information or its source is in high standing. For instance, the Yahoo stock quote service might have a higher reputation than that of some off shore bank; the CNN news server might have a higher reputation then that of the Phoenix Gazette. Reputation increases with a higher level of awareness among the users. Thus, older, long-established information sources will typically have a higher reputation. 
The reputation criterion can be important with some application. For instance, we observed that most biologists actually prefer certain sources over others because of their higher reputation. Also, people tend to trust data from their own institute more than external data, as they also tend to prefer well-known sources. This fact can be expressed with the help of reputation scores. 
Synonyms: credibility

top of page

Instantiation-related Criteria

Amount of data
is the size of the query result, measured in byte. Whenever appropriate, amount can also be measured as the number of result tuples. For instance, the number of links a search engine can return for a single request typically varies from 10 to 100. Note that this is independent of the actual number of hits a search engine discovers. We know of know search engine that will actually return more than 100 links, even if more were found. When querying a stock information service, for company profiles, amount is the length of the profile in byte. 
We argue that the larger the amount of data, the better we consider the source or response. There are methods to reduce the amount of data in a sophisticated way. For instance, techniques from the information retrieval area can be applied to find the best links from a set returned by a search engine. The more input such techniques have, the better their results will be: The probability to find a relevant link (by hand or automatically) is larger if more links are returned. 
The importance of the amount criterion depends on the type of query. In a query for a the stock quote of a certain company the amount of data returned is of no importance-it is simply a number. However, in a query for all information on a company including profiles, press releases etc, amount can be quite important. 
Synonyms: essentialness 
Representational conciseness 
is the degree to which the structure of the information matches the information itself. Search engines typically have a high conciseness-their main results are link lists which are represented as such. Molecular biology information systems on the other hand often have a low conciseness with incomprehensible data formats, many abbreviations, and unclear graphical representations. Also to most results the systems deliver a large amount of historical data which is no longer valid. 
In our context of a mediator-wrapper architecture representational conciseness is only of marginal importance. Wrappers extract the information from the sources and restructure them according to the global schema of the mediator. Any representational inconciseness would be caught by the wrapper and hidden to the user. Note however, representational conciseness is a measure for the complexity and stability of a wrapper. The less concise the representation is, the more difficult it is to build a wrapper around the source to the degree that parts or all of the information cannot be extracted. Low conciseness regarding previous information makes a wrapper highly unstable, i.e, the wrapper must be maintained and updated frequently. 
Synonyms: attribute granularity, occurrence identifiability, structural consistency, appropriateness, format precision 
Representational consistency
is the degree to which the structure of the information conforms to previously returned information. Since we review multiple sources, we extend this definition to not only compare compatibility with previous data but also with data of other sources. Thus representational consistency is also the degree to which the structure of the information conforms to that of other sources. 
We assume wrappers to deliver a relational export schema which is always consistent with the global schema against which we query. Representational consistency is thus a criterion to measure the work of the wrapper necessary to parse files, transform units and scales or translate identifiers into canonical object names. 
Synonyms: integrity, homogeneity, semantic consistency, value consistency, portability, compatibility 
Understandability
is the degree to which the information can be easily comprehended by the user. Thus, understandability measures how well a source presents its information, so that the user is able to comprehend its semantic value. Understandability is measured as a grade. The grade must be specified only once by the user and remains the same as long as the source does not undergo major changes in its appearance. The grade could possibly be determined with the help of a questionnaire containing questions on structure, language, layout etc. 
Understandability is only marginally important for the mediated information systems for the same reason as for representational conciseness. A wrapper extracts information from the information source and transforms it according to the relational schema of the mediator. Any good or bad understandability will be lost in this process. However, there are application domains or types of information, where the understandability score is retained. For instance, the understandability of a new article remains the same, independent of any representational changes. Also, graphics typically are not changed by the wrapper or mediator, so the understandability remains unchanged as well. 
Synonyms: ease of understanding 
Verifiability
is the degree and ease with which the information can be checked for correctness. When information is mistrusted, it should be verified with the help of a, if possible unbiased, third party. Verifiability is high if either the information source names the actual source of the information or if it points to a trusted third party source where the information can be checked for correctness. Note, that verifiability differs from believability in that verification can find an information correct incorrect, while belief trusts the information without checking. 
Verifiability is an important factor if the mediated system includes sources with a low believability or reputation. Especially WWW information sources can suffer a low scores in these criteria because they have not had the time to establish a good reputation. 
Synonyms: naturalness, traceability, provability
top of page
Felix Naumann

(2000/5/2)