|
Home
Project
details
People
Information
Quality
Publications
Related
sites
Demo |
Information Quality
Defining information quality (IQ) is extremely difficult due to the subjective
nature of quality. Definitions such as "fitness for use" are general enough
to be correct, but they are unusable when IQ is to be evaluated.
Information quality (IQ) has many facets. To capture these many dimensions,
quality is often described as some set of criteria, for instance by Wang
and Strong (1996). What follows is a long list of IQ criteria, as they
have surfaced during research. Not all criteria must be used in all systems.
Also, new criteria can easily be included to this list.
For links to other sites and projects concerning information quality, click
here.
Content-related Criteria
-
Accuracy
-
is the quotient of the number of correct values in the source
and the overall number of values in the source. For our context this is
the percentage of data without data errors such as non-unique keys
or out of range values.
-
Increasing accuracy is a main
goal of many research efforts. Accuracy is
often used synonymously with data quality, as opposed to information quality.
For us, data quality or accuracy is only
one aspect of the overall information quality, which includes the entire
set of criteria in this list.
-
Considering the accuracy criterion
in a WWW information system setting has the same importance as for traditional
databases. Accuracy is one of the main intrinsic
properties of information. Incorrect information is hard to detect, useless,
and in many cases even harmful.
-
Synonyms: data quality, error rate, correctness, reliability,
integrity, precision
-
Completeness
-
is the quotient of the number of non-null values
in a source and the number of non-null values in the combination
of all available sources. Applied to the relational schema of Chapter the
number of non-null values in the combination of all available
sources corresponds to the size of the universal relation: The number of
attributes multiplied with the number of tuples if all available sources
were queried. The number of non-null values in a source is then
the number of values a source can insert into the schema of the universal
relation.
-
We define completeness more
formally in Chapter . There we analyze this criterion in great depth and
apply it to several application domains. In Chapter we perform optimization
to maximize completeness.
-
Completeness is of great importance
in information systems that integrate multiple information sources. One
of the main goals for integration is to increase completeness:
Querying only one source typically gives only one part of the result. Querying
another source will provide another, possibly overlapping part. The more
sources we query, the more complete the result will be.
-
Synonyms: coverage, scope, granularity, comprehensiveness,
density, extent
-
Customer support
-
is the amount and usefulness of human help via email or phone.
This criterion is closely related to the documentation
criterion below. It is one part of an overall help system to guide users
in understanding and using information. Depending on the type of support,
one part of the measure could be the average waiting time for a response.
Another, more difficult part to be assessed, is how useful the help is.
-
For a discussion on the importance of this criterion, see
the
documentation criterion below, where
the same arguments apply.
-
Documentation
-
is the amount and usefulness of documents with meta information.
For WWW information systems documentation usually is in the form of "help"-links
that lead to Webpages that explain the provided data. As a simple measure
we count the number of words in the documentation. Issues of usefulness
and understandability are already covered by other criteria. We extend
the scope of those scores to the documentation part of the source.
-
The importance of the documentation
criterion depends on the application: Often the presentation of information
is self-describing and it is not necessary to measure how well a source
documents its information. For instance, this is the case for search engines.
On the other hand, there are domains where integration and use of the source
is not possible without good documentation. Molecular biology information
sources have great problems with synonyms and homonyms and other types
of heterogeneity. Without a good documentation query results are very prone
to misunderstanding.
-
Synonyms: clarity of definition, traceability
-
Interpretability
-
is the degree to which the information conforms to technical
ability of the consumer. Technical abilities include languages spoken,
units understood, etc. A highly interpretable source must also provide
clear and simple definitions of all elements of the information. In this
sense
interpretability is similar to documentation
and
understandability.
-
In integrated information systems, interpretability
of a source is not as important as other criteria, because we assume that
much of the issues are hidden by wrappers and the mediator. The wrappers
of a source can already convert units to suit the user, text can be automatically
translated at least to a useful extent, etc. It is then up to the wrapper
to present the integrated information in an interpretable way. Concluding,
an information source with a high interpretability
is more easy to include in a mediated system but the criterion play a less
important role once the source is successfully integrated.
-
Synonyms: clarity of definition, simplicity
-
Relevancy
(or
relevance)
-
is the degree to which the provided information satisfies
the users need. Relevancy is an often used
criterion in the field of information retrieval. There, a document or piece
of information is considered to be relevant to the query, if the keywords
of the query appear often and/or in prominent positions in the document.
The importance of relevancy
criterion depends on the application domain. For instance, for search engines
relevancy is quite important, returned Webpage
links should be as relevant as possible, even though this is difficult
to achieve. For instance a query for the term "jaguar" at any WWW search
engine will retrieve document links both for the animal and the automobile.
If the user had the animal in mind, the links to automobile sites should
have been considered as not relevant. In other application domains, relevancy
is implicitly high. For instance a query for IBM stock quotes in an integrated
stock information systems will only return relevant results, namely IBM
stock quotes. The reason for this discrepancy is the definition of the
domain: Search engines have the entire WWW as a domain and thus provide
much information that is of no interest to the user. The domain of a stock
information system is much more clear cut and much smaller, so a query
is less likely to produce irrelevant results.
For our purposes we reduce the relevancy
criterion to a correctness criterion. If a result is correct with respect
to the user query, we assume it is also relevant. If it is actually not
relevant, the user query was either incorrect with respect to what the
user had in mind or it was not specific enough. Relevance feedback techniques
were developed by Salton and McGill to make a query more specific and increase
relevancy.
Synonyms: domain precision, minimum redundancy, applicability,
helpfulness
-
Value-Added
-
is a criterion that measures the amount of monetary benefit
the use of the information provides. This criterion is typical for decision
support type of information systems where a cost-benefit-calculation is
undertaken. The value-added criterion must
be considered when there is cost involved obtaining the information and
when the nature of the information is yet unknown.
-
Often value-added cannot be
attributed to the source of the information but only to the information
itself. A stock information system will provide stock quotes but cannot
influence them and thus cannot increase "value-addedness"; a search engine
has no influence on how useful its results are. For this reason this criterion
is often not considered for WWW information systems.
top of page
Technical Criteria
-
Availability
-
of an information source is the probability that a feasible
query is correctly answered in a given time range. Availability
is a technical measure concerning hardware and software of the source and
the network connections between user, mediator, wrappers, and sources.
Typically, availability is also time-dependent
due to different usage patterns of the information source.
-
Availability is an important
criterion for WWW information sources for many reasons: Time-of day and
week dependent network congestion; world-wide distribution of servers;
high concurrent usage; denial-of-service attacks; planned maintenance interruptions.
Query execution in integrated systems is especially vulnerable to low availability
because usually
all participating sources of a query execution plan
must be available in order for the query to be executed correctly. In Section
we pay special attention to the availability
criterion and propose an algorithm that dynamically adapts its optimization
strategy in case of an unavailable source.
-
Synonyms: accessibility, technical reliability, retrievability,
performability
-
Latency
-
is the amount of time in seconds from issuing the query until
the first information reaches the user. If the result of the query is only
one piece of information, e.g., one stock quote, latency
equals response time (see below).
-
Latency is an important criterion
in WWW information system settings for two reasons: Information is sent
over the internet using hypertext transfer protocol (http). This protocol
sends data packaged in chunks of up to 64 kilobyte. If the entire response
has a larger size, the first package can be displayed before further packages
arrive. Additionally, many sources withhold the entire result and only
return the first part. For instance,search engines typically allow return
only the first 10 links. If the users desires more results, another query
must be posed by following a link. The second reason is that in many application
the user is actually only interested in the first part of the information
or only in an overview of the result. Again, search engines are a good
example. Often, the first 10 results are enough to satisfy the user, especially
if the results are ranked well. For many other applications, not the actual
result, but the number of results is the only interest of the user. Consider
a user querying a stock information system for companies whose stock have
risen more than 50% during the last year. Most often, not the actual companies
but their number is of interest.
-
Synonyms: Often response time
and latency are used synonymously.
-
Price
-
is the amount of money a user has to pay for a query as determined
by the provider. Commercial data sources usually either charge on a subscription
basis for their information or on a pay-per-query or pay-per-byte basis.
Often there is a direct tradeoff between price
and other IQ criteria. Free stock information services provide stock quotes
with some delay (usually 15 minute) while subscription systems provide
the quotes in realtime. Also there may be a hidden cost in retrieving information:
Users spend time online paying for the internet connection and users are
exposed to advertisements.
-
Considering price is important
if at least one integrated information source charges money for information.
It is common opinion that the world wide Web has prospered due to its free
information services. Information sources earn money by display advertisement.
Experts predict a change towards high quality information sources that
charge money for their services.
-
Synonyms: query value-to-cost ratio, cost-effectivity
-
Response
time
-
measures the delay in seconds between submission of a query
by the user and reception of the complete response from the information
system. The score for this criterion depends on unpredictable factors such
as network traffic, server workload etc. Another factor is the type and
complexity of the user query. Again this cannot not be predicted, however,
it can be taken into account, once the query is posed and a query execution
plan is developed. Finally, the technical equipment of the information
server plays a role as well. However, in WWW settings network delay usually
dominates all other factors.
-
Response time is the main criterion
for traditional database optimizers. While for WWW information systems
it is just one aspect among many other IQ criteria, it is still of some
significance. Because of frequent time-outs and unknown availability of
sources, users waiting long for a response from a WWW information source
are more prone to abort the query than database users. This cancellation
can be prevented by low
latency, which gives
users at least some results early on. Another reason for the importance
of low response time is the potential competition on the Web. With many
alternative sites the users will quickly switch from one source to another
to find the desired information. An integrated system such as a meta search
engine avoids this effect but must also consider response
time when deciding which sources to use to answer a query.
-
Synonyms: performance, turnaround time
-
Security
-
is the degree to which information is passed privately from
users to the information source and back.
Security
covers technical aspects such as cryptography, secure login etc., but also
the possibility of anonymization and authentification of the information
source by a trusted organization. Most WWW information sources publish
a privacy policy to show that they are concerned with the topic.
-
The importance of security
is very application domain dependent: Users of search engine typically
are not concerned about privacy-quite the contrary: The meta search engine
MetaCrawler provides a utility that allow users to watch queries as they
are passed to the engine. In other application domains users are very sensitive
towards security: User typically prefer their stock quote lookups to be
secure. Complex queries against molecular biology information systems can
already spell out a valuable idea.
-
Synonyms: privacy, access security
-
Timeliness
-
is the average age of the information in a source. The unit
of timeliness depends on the application:
for some seconds are appropriate, for others days are sufficiently precise.
Here, the age of data is not the time between creation of the data and
now but the time between the last update or verification of the data. For
instance the timeliness of search engines is their update-frequency, i.e.,
the frequency with which they re-index Web pages. It is not the age of
the Web page itself. For stock information systems, timeliness is a measure
for the delay with which stock quotes are presented. Typical free services
have a 15 minute delay between the occurrence of a quote and its delivery
to the user, while subscription quote services have much less or even no
delay. In a fast growing area such as molecular biology it is reasonable
to use the update-frequency of data source rather than the average age
of the data as criterion.
-
Timeliness is arguably one
of the most important criteria for WWW information sources. The main advantage
of the Internet over traditional information sources like newspapers or
journals is its ability to provide new information almost instantly and
world-wide. A main reason of users to turn to WWW information services
is to obtain up-to-date information. For search engines, high timeliness
for instance means less dead links, for stock information systems high
timeliness
allows quicker reactions to changes on the stock market.
-
Synonyms: up-to-date, freshness, currentness
top of page
Intellectual criteria
-
Believability
-
is the degree to which the information is accepted as correct
by the user. In a sense, believability is
the expected accuracy. Therefore it can be
determined by the same unit as accuracy,
but generally, the believability criterion
will be influenced by many other factors so that a generic "grade" will
be more appropriate.
-
When querying autonomous information sources
believability
is an important criterion. Apart from simply providing information, a source
must convince the user, that this information is "accepted or regarded
as true, real, and credible".
-
Synonyms: error rate, credibility, trustworthiness
-
Objectivity
-
is the degree to which information is unbiased and impartial.
The criterion score mainly depends on the affiliation of the information
provider. Also, the criterion is strongly related to the verifiability
criterion: The more verifiable a source is, there more objective it will
be. Again,
objectivity is measured by some
grade as there is nor "real" unit for this criterion.
-
Objectivity is an important criterion if users fear some
malice of the information source. This fear could be approached by simply
not using an information source with low
objectivity
or at least by verifying the information. Search engines often display
biased information for two reasons: (i) Web pages indexed by the search
engine add certain keywords to their page to be ranked higher for searches.
A popular example is to repeat the word "sex" thousands of times on a web
page. (ii) Search engines can be payed by Web site providers to purposefully
rank their pages higher than others overriding the standard ranking algorithm
employed by the search engine. Such bias is difficult to detect since search
engines do not publish their ranking algorithms. Stock quotes on the other
hand can easily be verified, so bias is not very likely and thus, objectivity
is not an important criterion for that domain.
-
Reputation
-
is the degree to which the information or its source is in
high standing. For instance, the Yahoo stock quote service might have a
higher reputation than that of some off shore bank; the CNN news server
might have a higher reputation then that of the Phoenix Gazette. Reputation
increases with a higher level of awareness among the users. Thus, older,
long-established information sources will typically have a higher reputation.
-
The reputation criterion can
be important with some application. For instance, we observed that most
biologists actually prefer certain sources over others because of their
higher reputation. Also, people tend to trust data from their own institute
more than external data, as they also tend to prefer well-known sources.
This fact can be expressed with the help of
reputation
scores.
-
Synonyms: credibility
top of page
Instantiation-related Criteria
-
Amount of data
-
is the size of the query result, measured in byte. Whenever
appropriate, amount can also be measured
as the number of result tuples. For instance, the number of links a search
engine can return for a single request typically varies from 10 to 100.
Note that this is independent of the actual number of hits a search engine
discovers. We know of know search engine that will actually return more
than 100 links, even if more were found. When querying a stock information
service, for company profiles, amount is
the length of the profile in byte.
-
We argue that the larger the amount
of data, the better we consider the source or response. There are methods
to reduce the amount of data in a sophisticated way. For instance, techniques
from the information retrieval area can be applied to find the best links
from a set returned by a search engine. The more input such techniques
have, the better their results will be: The probability to find a relevant
link (by hand or automatically) is larger if more links are returned.
-
The importance of the amount
criterion depends on the type of query. In a query for a the stock quote
of a certain company the amount of data returned
is of no importance-it is simply a number. However, in a query for all
information on a company including profiles, press releases etc, amount
can be quite important.
-
Synonyms: essentialness
-
Representational
conciseness
-
is the degree to which the structure of the information matches
the information itself. Search engines typically have a high conciseness-their
main results are link lists which are represented as such. Molecular biology
information systems on the other hand often have a low conciseness
with incomprehensible data formats, many abbreviations, and unclear graphical
representations. Also to most results the systems deliver a large amount
of historical data which is no longer valid.
-
In our context of a mediator-wrapper architecture
representational
conciseness is only of marginal importance. Wrappers extract the
information from the sources and restructure them according to the global
schema of the mediator. Any representational inconciseness would be caught
by the wrapper and hidden to the user. Note however, representational
conciseness is a measure for the complexity and stability of a wrapper.
The less concise the representation is, the more difficult it is to build
a wrapper around the source to the degree that parts or all of the information
cannot be extracted. Low conciseness regarding previous information makes
a wrapper highly unstable, i.e, the wrapper must be maintained and updated
frequently.
-
Synonyms: attribute granularity, occurrence identifiability,
structural consistency, appropriateness, format precision
-
Representational
consistency
-
is the degree to which the structure of the information conforms
to previously returned information. Since we review multiple sources, we
extend this definition to not only compare compatibility with previous
data but also with data of other sources. Thus representational
consistency is also the degree to which the structure of the information
conforms to that of other sources.
-
We assume wrappers to deliver a relational export schema
which is always consistent with the global schema against which we query.
Representational
consistency is thus a criterion to measure the work of the wrapper
necessary to parse files, transform units and scales or translate identifiers
into canonical object names.
-
Synonyms: integrity, homogeneity, semantic consistency, value
consistency, portability, compatibility
-
Understandability
-
is the degree to which the information can be easily comprehended
by the user. Thus, understandability measures how well a source presents
its information, so that the user is able to comprehend its semantic value.
Understandability
is measured as a grade. The grade must be specified only once by the user
and remains the same as long as the source does not undergo major changes
in its appearance. The grade could possibly be determined with the help
of a questionnaire containing questions on structure, language, layout
etc.
-
Understandability is only marginally
important for the mediated information systems for the same reason as for
representational
conciseness. A wrapper extracts information from the information
source and transforms it according to the relational schema of the mediator.
Any good or bad understandability will be lost in this process. However,
there are application domains or types of information, where the understandability
score is retained. For instance, the
understandability
of a new article remains the same, independent of any representational
changes. Also, graphics typically are not changed by the wrapper or mediator,
so the
understandability remains unchanged
as well.
-
Synonyms: ease of understanding
-
Verifiability
-
is the degree and ease with which the information can be
checked for correctness. When information is mistrusted, it should be verified
with the help of a, if possible unbiased, third party. Verifiability
is high if either the information source names the actual source of the
information or if it points to a trusted third party source where the information
can be checked for correctness. Note, that verifiability
differs from believability in that verification
can find an information correct incorrect, while belief trusts the information
without checking.
-
Verifiability is an important
factor if the mediated system includes sources with a low believability
or
reputation. Especially WWW information
sources can suffer a low scores in these criteria because they have not
had the time to establish a good reputation.
-
Synonyms: naturalness, traceability, provability
top of page |