Processing Pdf 179476 | Irjaes V2n2p214y17

Partial capture of text on file.

International Research Journal of Advanced Engineering and Science
ISSN (Online): 2455-9024

Information Retrieval Models, Techniques and
Applications
1 2
Olalere A. Abass , Oluremi A. Arowolo
1, 2Dept. of Computer Science, Tai Solarin College of Education, Omu-Ijebu, Ogun State, Nigeria

Abstract— An Information Retrieval (IR) system focuses on the different areas of application of IR techniques. Finally, the
processing of data collection by means of representation, storage, conclusion of the paper is drawn in Section 5.
and searching for the purpose of knowledge discovery in response to
user request via query. The tendency of the IRS to produce relevant II. INFORMATION RETRIEVAL SYSTEM
documents with high precision and recall to meet user’s need based A. Framework of IRS
on query input depends on the adoption of the appropriate techniques According to Sharma and Patel (2013), there are three
by the search engine. In this paper, we explain the concepts of IR and
traditional models in which various IR techniques rely upon. We basic processes an IRS has to support: (i) the representation of
equally give detail description of IR techniques that have been the content of the documents, (ii) the representation of the
successfully applied to store, manage and retrieve documents from user's information need, and (iii) the comparison of the two
huge amount of data available to users of IR systems. This shows that representations. The processes are visualized in Figure 1 as
applications of these retrieval techniques in digital libraries, opined by Sharma and Patel (2013). In the figure, squared
information filtering system, media search, search engine and boxes represent data and rounded boxes represent processes.
domain-specific areas of IR are capable of increasing the throughput Representing the documents is usually called the indexing
and minimize the access time of the user with respect to information process.
needs.

Keywords— IRS framework, models, IR techniques, recall, precision.
I. INTRODUCTION
Information retrieval (IR), as subfield of computer science,
deals with the representation, storage, and access of
information and is concerned with the organization and
retrieval of information from large database collections
(Sagayam et al, 2012). In response to user request via query,
IR focuses on the processing a collection of data by means of
representation, storage, and searching for the purpose of
knowledge discovery. This process involves various stages
initiated with representing data and ending with returning
relevant information to the user. Intermediate stages include
filtering, searching, matching and ranking operations. The
primary objective of information retrieval system (IRS) is to
support users to access relevant information corresponding to
their needs or a document that satisfies user information
needs.
According to [1], there are two basic measures for
assessing the quality of IRS as follows: (i) Precision- the
percentage of retrieved documents that are in fact relevant to
the query and (ii) Recall - the percentage of documents that Fig. 1. A general framework of IRS.
are relevant to the query and were in fact retrieved. The
tendency of the IRS to yield a list of relevant documents with The process takes place off-line, that is, the end user of the
high precision and recall to meet user’s need as specified by IRS is not directly involved. The indexing process results in a
the query depends on the use of the appropriate techniques by representation of the document, the process of representing
the search engine. This remains the focus of the paper as we user’s information need is often referred to as the query
attempt to fully explain different IR techniques so far used by formulation process and resulting representation is the query
various researchers and the developers of the IRS. (Hiemstra, 2009). Comparing the two representations is
The structure of this paper is as follows. A brief review of known as the matching process. Retrieval of documents is the
IRS framework and IR models are presented in Section 2, result of this process.
followed by IR techniques in Section 3. Section 4 deals with
197

Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of

Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
International Research Journal of Advanced Engineering and Science
ISSN (Online): 2455-9024

B. Information Retrieval Models most important characteristic of the probabilistic model is
Mathematical models are used in many scientific areas with its attempt to rank documents by their probability of relevance
the objective to understand and reason about some behaviour or given a query (Robertson and Jones, 1976). Documents and
phenomenon in the real world. A model of IR predicts and queries are represented by binary vectors ~d and ~q, each
explains what a user will find relevant to a given query. The vector element indicating whether a document attribute or
correctness of the model’s predictions can be tested in a term occurs in the document or query, or not. Instead of
controlled experiment. Hence, a model of IR serves as a probabilities, the probabilistic model uses odds O(R), where
blueprint which is used to implement an actual IRS (Hiemstra, O(R) = P(R)/1 − P(R), R means ‟document is relevant”
2009). and means ‟document is not relevant” (Hiemstra et al, 2000).
C. The Traditional or Classical Models III. INFORMATION RETRIEVAL TECHNIQUES
The three most used models in IR research are the vector A. Term Weighting
space, the probabilistic model, and the inference network Weighting methods developed under the probabilistic
models (Singhal, 2001). These three models are regarded as the models rely heavily upon better estimation of various
traditional retrieval models. probabilities (Singhal, 2001). Term weighting is a technique
i. Boolean model (BM) - A measure of exact match of obtaining the most critical piece of information needed for
This model provides exact matching, i.e. documents are document ranking in all IR models. Various methods for
either retrieved or not, but the retrieved documents are not weighting terms have been developed in the field. Weighting
ranked. The retrieval function in this model treats a document methods developed under the probabilistic models rely heavily
as either relevant or irrelevant (Alhenshiri, 2003). That is, in upon better estimation of various probabilities (Robertson and
BM, the retrieved documents are adjudged as either Jones, 1976)]. Methods developed under the VSM are often
“relevant” or “not relevant”.
ii. Vector space model (VSM) - A measure of document based on researchers’ experience with systems and large scale
similar to query by ranking experimentation Salton and Buckley, 1988). In both models,
The VSM can best be characterized by its attempt to rank three main factors come into play in the final term weight
documents by the similarity between the query and each formulation (Singhal, 2001):
document Salton and McGil, 1986). In the VSM, i. Term Frequency (or tf)
documents and query are represented as a vector and the Words that repeat multiple times in a document are
angle between the two vectors is computed using the similarity considered salient. Term weights based on tf have been used in
cosine function. Similarity Cosine function can be defined as the VSM since the 1960s. TF addresses how relevant is a
(Sharma and Patel, 2013): particular document d to the given particular term t. One way of
measuring TF(d,t), the relevance of a document d to term t, is:
(1) (4)
Documents and queries are represented as vectors. where n(d) denotes the number of terms in the document and
(2) n(d,t) denotes the number of occurrences of term t in the
(3) document d.
VSM introduced term weight scheme known as if-idf ii. Document Frequency
weighting. These weights have a term frequency (tf) factor Words that appear in many documents are considered
measuring the frequency of occurrence of the terms in the common and are not very indicative of document content. A
document or query texts and an inverse document frequency weighting method based on this, called inverse document
(idf) factor measuring the inverse of the number of documents frequency (or idf) weighting, was proposed by Sparck-Jones
that contain a query or document term (Samar et al, 2016). The early 1970s. In a query which may contain multiple keywords,
VSM of IR is a very successful statistical method proposed by the relevance of a document to such a query with two or more
Salton and Buckley, 1988). keywords is estimated by combining the relevance measures of
A major achievement of the researchers that developed the the document to each word. A simple way to combine the
VSM is the introduction of the family of tf.idf term weights. measure is to add them up. However, not all terms used as
These weights have a term frequency (tf) factor measuring the keywords are equal. To fix this problem, weights are assigned to
frequency of occurrence of the terms in the document or query terms using the inverse document frequency (IDF) defined as:
texts and an inverse document frequency (idf) factor measuring (5)
the inverse of the number of documents that contain a query or where n(t) denotes the number of documents (among those
document term. indexed by the system) that contain the term t.
iii. The probabilistic model – A measure of probability The relevance of a document d to a set of terms Q is the
of relevance defined as:
This family of IR models is based on the general principle (6)
that documents in a collection should be ranked by decreasing The weight of an index term is proportional to its
probability of their relevance to a query. This is often called the frequency in a document (term frequency or tf factor), and
probabilistic ranking principle –PRP (Robertson, 1990). The inversely proportional to its frequency among all documents in
198

Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of

Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
International Research Journal of Advanced Engineering and Science
ISSN (Online): 2455-9024

the system (inverse document frequency or idf factor). This occurrence in the documents (which often produces a list of
measure can be further refined if the user is permitted to specify strongly related words). Most query augmentation techniques
weights w(t) for terms in the query, in which case the user- based on automatically generated thesaurii had very limited
specified weights are also taken into account by multiplying success in improving search effectiveness. The main reason
TF(t) by w(t) in the above formula. This approach of using term behind this is the lack of query context in the augmentation
frequency and inverse document frequency is called TF-IDF process. Not all words related to a query word are meaningful
approach. in context of the query.
It is important that the assignment of weights to every C Relevance Feedback for Query Modification
index term (called “term weighting”) is automatic. The so- In IRS, the indexing step pre-processes documents and
called TF-IDF method is mainly used for knowing the weight queries in order to obtain keywords (relevant words, also
of a term; TF is the frequency of occurrence of a term in a named terms) to be used in the query. At this point, it is
document and IDF varies inversely with the number of important to consider the use of stemming and stopword
document to which the term is assigned Ropero et al, 2012). (removal of words or terms that carry little or no semantically
iii. Document Length important information during searching and indexing
This is the third factor in term weighting. When collections processes) lists in order to reduce related words to their stem,
have documents of varying lengths, longer documents tend to base or root form.
score higher since they contain more words due to word Matching, as a process, involves computation of the
repetitions. This effect is usually compensated by normalizing similarity between documents and queries by weighting terms.
for document lengths in the term weighting method. Before The TF-IDF and BM25 (best match) algorithms are the
TREC (Text Retrieval Conference), both the VSM and the frequently applied algorithms for term weighting. Base on the
probabilistic models developed term weighting schemes which use of these algorithms, most IRS return a list of ranked
were shown to be effective on the small test collections available document in response to a query where the documents more
then. Inception of TREC provided IR researchers with very large similar to the query considered by the system are first on the
and varied test collections allowing rapid development of list. Once the first answer set is obtained, different query
effective weighting schemes. The state-of-the-art scoring expansion techniques can be applied. For example, the most
technique that combines the above three factors is called relevant keywords of the top documents previously retrieved
Okapi weighting based document score (Robertson et al, can be added to the query in order to re-rank the documents.
1999) as shown in the Eq. 7 below. This process is called “relevance feedback” (RF). The retrieval
k 11tf k qtf
Ndf 0.5    
13
ln . . (7) can be further enhanced by modifying the words of the queries
 dl
df 0.5 k qtf
tQ,D 
k (1b)b tf 3 using other keywords more representative of the document

1 avdl
 content - e.g., including MeSH Headings (Rivas et al, 2014).
tf is the term’s frequency in document In 1965, Rocchio proposed using RF for query modification
qtf is the term’s frequency in query (Singhal, 2001). RF is motivated by the fact that it is easy for
N is the total number of documents in the collection users to judge some documents as relevant or non-relevant for
df is the number of documents that contain the term their query. Using such relevance judgments, a system can then
dl is the document length (in bytes), and automatically generate a better query by adding related new
avdl is the average document lenght terms for further searching. In general, the user is asked to judge
k (between 1.0-2.0), b (usually 0.75) and k (between 0-1000) are the relevance of the top few documents retrieved by IRS.
1 3 Based on these judgments, the system modifies the query
constants
According to Singhal (2001), the pivoted normalization and issues the new query for finding more relevant documents
weighting based document score is from the collection. RF has been shown to work quite
1ln 1 ln(tf ) effectively across test collections.
 .qtf .ln N 1 (8) Rocchio algorithm was the RF mechanism introduced and
 dl df
tQ,D 1ss
  avdl popularized by Salton’s SMART system. In a real IR query
where s is a constant (usually 0.20). context, there exists a user query and partial knowledge of
known relevant and non-relevant documents.
B. Query Modification using Synonyms
In the early years of IR, researchers realized that it was quite
hard for users to formulate effective search requests. It was
thought that adding synonyms of query words to the query
should improve search effectiveness. Early research in IR relied
on a thesaurus to find synonyms (Singhal, 2001).
However, it is quite expensive to obtain a good general- The algorithm proposes using the modified query in Eq. 9
purpose thesaurus. Researchers then developed techniques to where q is the original query vector, D and D are the set of
automatically generate thesauri for use in query modification. 0 r nr
Most of the automatic methods are based on analyzing word co- known relevant and non-relevant documents respectively, and
a, b, and g are weights attached to each term. These control
199

Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of

Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
International Research Journal of Advanced Engineering and Science
ISSN (Online): 2455-9024

the balance between trusting the judged documents set versus user input query. Serizawa and Kobayashi (2013) opine that
the query: if there is a lot of judged documents, a higher β and several important points should be considered in the
γ are obtained. development and implementation of algorithms for
Starting from q0, the new query moves the user some clustering documents in very large databases. These include
distance toward the centroid of the relevant documents and identifying relevant attributes of documents and
some distance away from the centroid of the non-relevant determining appropriate weights for each attribute; selecting
documents. This new query can be used for retrieval in the an appropriate clustering method and similarity measure;
standard VSM. We can easily leave the positive quadrant of estimating limitations on computational and memory
the vector space by subtracting off a non-relevant document’s resources; evaluating the reliability and speed of the
vector. retrieved results; facilitating changes or updates in the
In the Rocchio algorithm, negative term weights are database, taking into account the rate and extent of the
ignored. That is, the term weight is set to 0. RF can improve changes; and selecting an appropriate search algorithm for
both recall and precision. But, in practice, it has been shown to retrieval and ranking. This final point is of particularly
be most useful for increasing recall in situations where recall great concern for Web-based searches. Serizawa and
is important. This is partly because the technique expands the Kobayashi (2013) further stress further that there are two
query, but it is also partly an effect of the use case: when they main categories of clustering: hierarchical and non-
want high recall, users can be expected to take time to review hierarchical. Hierarchical methods show greater promise for
results and to iterate on the search. Positive feedback also enhancing Internet search and retrieval systems. Although
turns out to be much more valuable than negative feedback, details of clustering algorithms used by major search
and so most IRS set γ < β. Reasonable values might be α = 1, β engines are not publicly available, some general
= 0.75, and γ = 0.15. In fact, many IRS allow only positive approaches are known. For instance, Digital Equipment
feedback, which is equivalent to setting γ = 0. Another Corporation’s Web search engine, AltaVista, is based on
alternative is to use only the marked non-relevant documents clustering. Anick (2003) explore how to combine results
which received the highest ranking from the IR system as from latent semantic indexing and analysis of phrases for
negative feedback. context-based information retrieval on the Web.
New techniques to do meaningful QE in absence of any user E Natural Language Processing (NLP)
feedback were developed early 1990s. Most notable of these is NLP has also been proposed as a tool to enhance retrieval
pseudo-feedback, a variant of relevance feedback (Buckley et effectiveness but with very limited success (Strzalkowski et al,
al, 1995). Given that the top few documents retrieved by an 1997). Despite that document ranking is a critical application for
IRS are often on the general query topic, selecting related terms IR, it is definitely not the only application. The field has
from these documents should yield useful new terms developed techniques to attack many different problems like
irrespective of document relevance. In pseudo-feedback, the information filtering (Belkin and Croft (1992), topic detection
IRS assumes that the top few documents retrieved based on the and tracking (or TDT) (Allan et al, 2000), speech retrieval
initial user query are “relevant”, and does RF to generate a new ((Sparck et al, 2000), cross-language retrieval (Grefenstette,
query. This expanded new query is then used to rank documents 1998), question answering (Pasca and Harabagiu, 2001), and
for presentation to the user. Pseudo feedback has been shown to many more.
be a very effective technique, especially for short user queries.
D. Document Clustering F. Indexing
Many other techniques have been developed over the years The term “indexing” is used in the same spirit in the
with varying degree of success. This is a process of grouping context of retrieval and ranking has a specific meaning.
similar documents together to perform the task of IR fast and Some definitions proposed by experts are “a collection of
efficiently. It is just one of several ways of organizing terms with pointers to places where information about
documents to facilitate retrieval from large databases. documents can be found” (Manber. 1999). Indexing is
Clustering hypothesis states that documents that cluster together building a data structure that will allow quick searching
(are very similar to each other) will have a similar relevance of the text (Baeza-Yates and Ribeiro-Neto, 1999) or the act
profile for a given query (Griffiths and Steyyers, 2004). of assigning index terms to documents, which are the
Document clustering techniques were (and still are) an active objects to be retrieved (Korfhage, 1997). Serizawa and
area of research. Though the usefulness of document clustering Kobayashi (2013) identified four approaches to indexing
for improved search effectiveness (or efficiency) has been very documents on the Web which are (1) human or manual
limited, document clustering has allowed several developments indexing; (2) automatic indexing; (3) intelligent or agent-
in IR, e.g., for browsing and search interfaces. based indexing; and (4) metadata, resource description
During the IR and ranking process, two classes of framework (RDF), and annotation-based indexing. The
similarity measures must be considered: (i) the similarity of first two appear in many classical texts, while the latter
a document and a query; and (ii) the similarity of two two are relatively new and promising areas of study.
documents in a database. The similarity of two documents However, the development of effective indexing tools to aid
is important for identifying groups of documents in a in filtering is another major class of problems associated
database that can be retrieved and processed together for a with Web-based search and retrieval.
200

Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of

Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.

The words contained in this file might help you see if this file matches what you are looking for:

...International research journal of advanced engineering and science issn online information retrieval models techniques applications olalere a abass oluremi arowolo dept computer tai solarin college education omu ijebu ogun state nigeria abstract an ir system focuses on the different areas application finally processing data collection by means representation storage conclusion paper is drawn in section searching for purpose knowledge discovery response to user request via query tendency irs produce relevant ii documents with high precision recall meet s need based framework input depends adoption appropriate according sharma patel there are three search engine this we explain concepts traditional which various rely upon basic processes has support i equally give detail description that have been content successfully applied store manage retrieve from iii comparison two huge amount available users systems shows representations visualized figure as these digital libraries opined squared ...

Related files

Share

Help

Related files

Share

Share to social media

Help

Login Area