jagomart
digital resources
picture1_Processing Pdf 179476 | Irjaes V2n2p214y17


 143x       Filetype PDF       File size 0.38 MB       Source: irjaes.com


File: Processing Pdf 179476 | Irjaes V2n2p214y17
international research journal of advanced engineering and science issn online 2455 9024 information retrieval models techniques and applications 1 2 olalere a abass oluremi a arowolo 1 2dept of computer ...

icon picture PDF Filetype PDF | Posted on 30 Jan 2023 | 2 years ago
Partial capture of text on file.
                             International Research Journal of Advanced Engineering and Science 
                                                                                                                                ISSN (Online): 2455-9024 
            
            
                     Information Retrieval Models, Techniques and 
                                                                  Applications 
                                                                             1                                   2 
                                                   Olalere A. Abass , Oluremi A. Arowolo
                             1, 2Dept. of Computer Science, Tai Solarin College of Education, Omu-Ijebu, Ogun State, Nigeria 
                                                                                    
                                                                                    
           Abstract—  An  Information  Retrieval  (IR)  system  focuses  on  the     different  areas  of  application  of  IR  techniques.  Finally,  the 
           processing of data collection by means of representation, storage,        conclusion of the paper is drawn in Section 5.  
           and searching for the purpose of knowledge discovery in response to 
           user request via query. The tendency of the IRS to produce relevant                    II.    INFORMATION RETRIEVAL SYSTEM  
           documents with high precision and recall to meet user’s need based        A.  Framework of IRS 
           on query input depends on the adoption of the appropriate techniques          According  to  Sharma  and  Patel  (2013),  there  are  three 
           by the search engine. In this paper, we explain the concepts of IR and 
           traditional  models  in  which  various  IR  techniques  rely  upon.  We  basic processes an IRS has to support: (i) the representation of 
           equally  give  detail  description  of  IR  techniques  that  have  been  the  content  of  the  documents,  (ii)  the  representation  of  the 
           successfully applied to store, manage and retrieve documents from         user's information need, and (iii) the comparison of the two 
           huge amount of data available to users of IR systems. This shows that     representations.  The  processes are  visualized  in  Figure  1  as 
           applications  of  these  retrieval  techniques  in  digital  libraries,   opined  by  Sharma  and  Patel  (2013).  In  the  figure,  squared 
           information  filtering  system,  media  search,  search  engine  and      boxes represent data and rounded boxes represent processes. 
           domain-specific areas of IR are capable of increasing the throughput      Representing  the  documents  is  usually  called  the  indexing 
           and minimize the access time of the user with respect to information      process. 
           needs. 
                                                                                      
           Keywords— IRS framework, models, IR techniques, recall, precision. 
                                   I.    INTRODUCTION  
           Information  retrieval  (IR),  as  subfield  of  computer  science, 
           deals  with  the  representation,  storage,  and  access  of 
           information  and  is  concerned  with  the  organization  and 
           retrieval  of  information  from  large  database  collections 
           (Sagayam et al, 2012). In response to user request via query, 
           IR focuses on the processing a collection of data by means of 
           representation,  storage,  and  searching  for  the  purpose  of 
           knowledge  discovery.  This  process  involves  various  stages 
           initiated  with  representing  data  and  ending  with  returning 
           relevant information to the user. Intermediate stages include 
           filtering,  searching,  matching  and  ranking  operations.  The 
           primary objective of information retrieval system (IRS) is to 
           support users to access relevant information corresponding to 
           their  needs  or  a  document  that  satisfies  user  information 
           needs. 
               According  to  [1],  there  are  two  basic  measures  for 
           assessing  the  quality  of  IRS  as  follows:  (i)  Precision-  the 
           percentage of retrieved documents that are in fact relevant to                                                                                 
           the query and (ii) Recall - the percentage of documents that                                 Fig. 1. A general framework of IRS. 
           are  relevant  to  the  query  and  were  in  fact  retrieved.  The        
           tendency of the IRS to yield a list of relevant documents with                The process takes place off-line, that is, the end user of the 
           high precision and recall to meet user’s need as specified by             IRS is not directly involved. The indexing process results in a 
           the query depends on the use of the appropriate techniques by             representation  of  the  document,  the  process  of  representing 
           the search engine. This remains the focus of the paper as we              user’s  information  need  is  often  referred  to  as  the  query 
           attempt to fully explain different IR techniques so far used by           formulation process and resulting representation is the query 
           various researchers and the developers of the IRS.                        (Hiemstra,  2009).  Comparing  the  two  representations  is 
               The structure of this paper is as follows. A brief review of          known as the matching process. Retrieval of documents is the 
           IRS  framework  and  IR  models  are  presented  in  Section  2,          result of this process. 
           followed by IR techniques in Section 3. Section 4 deals with 
                                                                                197 
            
             Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of 
                                                                                        
             Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
                                        International Research Journal of Advanced Engineering and Science 
                                                                                                                                                                                ISSN (Online): 2455-9024 
               
               
              B.  Information Retrieval Models                                                                       most important characteristic of the probabilistic model is 
                    Mathematical models are used in many scientific areas with                                       its attempt to rank documents by their probability of relevance 
              the objective to understand and reason about some behaviour or                                         given a query (Robertson and Jones, 1976). Documents and 
              phenomenon  in  the  real  world.  A  model  of  IR  predicts  and                                     queries are represented by binary vectors ~d and ~q, each 
              explains what a user will find relevant to a given query. The                                          vector element indicating whether a document attribute or 
              correctness  of  the  model’s  predictions  can  be  tested  in  a                                     term occurs  in  the  document  or  query,  or  not. Instead of 
              controlled  experiment.    Hence,  a  model  of  IR  serves  as  a                                     probabilities, the probabilistic model uses odds O(R), where 
              blueprint which is used to implement an actual IRS (Hiemstra,                                          O(R) = P(R)/1 − P(R), R means ‟document is relevant” 
              2009).                                                                                                 and   means ‟document is not relevant” (Hiemstra et al, 2000). 
              C.  The Traditional or Classical Models                                                                              III.     INFORMATION RETRIEVAL TECHNIQUES  
                    The three most used models in IR research are the vector                                         A.  Term Weighting 
              space,  the  probabilistic  model,  and  the  inference  network                                            Weighting  methods  developed  under  the  probabilistic 
              models (Singhal, 2001). These three models are regarded as the                                         models  rely  heavily  upon  better  estimation  of  various 
              traditional retrieval models.                                                                          probabilities (Singhal, 2001). Term weighting is a technique 
              i.     Boolean model (BM) - A measure of exact match                                                   of obtaining the most critical piece of information needed for 
                    This model provides exact matching, i.e. documents are                                           document  ranking  in  all  IR  models.  Various  methods  for 
              either  retrieved  or  not,  but  the  retrieved  documents  are  not                                  weighting terms have been developed in the field. Weighting 
              ranked. The retrieval function in this model treats a document                                         methods developed under the probabilistic models rely heavily 
              as either relevant or irrelevant (Alhenshiri, 2003). That is, in                                       upon better estimation of various probabilities (Robertson and 
              BM,  the  retrieved  documents  are  adjudged  as  either                                              Jones,  1976)].  Methods  developed  under  the  VSM  are often 
              “relevant” or “not relevant”. 
              ii.    Vector  space  model  (VSM)  -  A  measure  of  document                                        based on researchers’ experience with systems and large scale 
              similar to query by ranking                                                                            experimentation Salton and Buckley, 1988). In both models, 
                    The VSM can best be characterized by its attempt to rank                                         three  main  factors  come  into  play  in  the  final  term  weight 
              documents by the similarity  between  the  query  and  each                                            formulation (Singhal, 2001):   
              document  Salton  and  McGil,  1986).  In  the  VSM,                                                   i.     Term Frequency (or tf) 
              documents and query are represented as a vector and the                                                     Words  that  repeat  multiple  times  in  a  document  are 
              angle between the two vectors is computed using the similarity                                         considered salient. Term weights based on tf have been used in 
              cosine function. Similarity Cosine function can be defined as                                          the  VSM  since  the  1960s.  TF  addresses  how  relevant  is  a 
              (Sharma and Patel, 2013):                                                                              particular document d to the given particular term t. One way of 
                                                                                                                     measuring TF(d,t), the relevance of a document d to term t, is: 
                                                                                                          (1)                                                                                                    (4) 
              Documents and queries are represented as vectors.                                                      where n(d) denotes the number of terms in the document and 
                                                                                                          (2)        n(d,t)  denotes  the  number  of  occurrences  of  term  t  in  the 
                                                                                                          (3)        document d. 
                    VSM introduced  term  weight  scheme  known  as  if-idf                                          ii.    Document Frequency 
              weighting. These weights have a term frequency (tf) factor                                                  Words  that  appear  in  many  documents  are  considered 
              measuring the frequency of occurrence of the terms in the                                              common and are  not  very  indicative  of  document  content.  A 
              document or query texts and an inverse document frequency                                              weighting  method  based  on  this,  called  inverse  document 
              (idf) factor measuring the inverse of the number of documents                                          frequency  (or  idf)  weighting,  was  proposed  by  Sparck-Jones 
              that contain a query or document term (Samar et al, 2016). The                                         early 1970s. In a query which may contain multiple keywords, 
              VSM of IR is a very successful statistical method proposed by                                          the relevance of a document to such a query with two or more 
              Salton and Buckley, 1988).                                                                             keywords is estimated by combining the relevance measures of 
                    A major achievement of the researchers that developed the                                        the  document  to  each  word.  A  simple  way  to  combine  the 
              VSM is the introduction of the family of tf.idf term weights.                                          measure  is  to  add  them  up.  However,  not  all  terms  used  as 
              These weights have a term frequency (tf) factor measuring the                                          keywords are equal. To fix this problem, weights are assigned to 
              frequency of occurrence of the terms in the document or query                                          terms using the inverse document frequency (IDF) defined as: 
              texts and an inverse document frequency (idf) factor measuring                                                                                                                                     (5) 
              the inverse of the number of documents that contain a query or                                         where  n(t)  denotes  the  number  of  documents  (among  those 
              document term.                                                                                         indexed by the system) that contain the term t.  
              iii.  The probabilistic model – A measure of probability                                                    The relevance of a document d to a set of terms Q is the 
                     of relevance                                                                                    defined as: 
                    This family of IR models is based on the general principle                                                                                                                                   (6) 
              that documents in a collection should be ranked by decreasing                                               The  weight  of  an  index  term  is  proportional  to  its 
              probability of their relevance to a query. This is often called the                                    frequency  in  a  document  (term  frequency  or  tf  factor),  and 
              probabilistic ranking principle –PRP (Robertson, 1990). The                                            inversely proportional to its frequency among all documents in 
                                                                                                               198 
               
                 Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of 
                                                                                                                         
                 Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
                                         International Research Journal of Advanced Engineering and Science 
                                                                                                                                                                                       ISSN (Online): 2455-9024 
                
                
               the  system (inverse document frequency or idf factor). This                                               occurrence in the documents (which often produces a list of 
               measure can be further refined if the user is permitted to specify                                         strongly related words). Most query augmentation techniques 
               weights  w(t)  for  terms  in  the  query,  in  which  case  the  user-                                    based  on  automatically  generated  thesaurii  had  very  limited 
               specified  weights  are  also  taken  into  account  by  multiplying                                       success  in  improving  search  effectiveness.  The  main  reason 
               TF(t) by w(t) in the above formula. This approach of using term                                            behind this is the lack of query context in the augmentation 
               frequency  and  inverse  document  frequency  is  called  TF-IDF                                           process. Not all words related to a query word are meaningful 
               approach.                                                                                                  in context of the query.   
                     It  is  important  that  the  assignment  of  weights  to  every                                     C  Relevance Feedback for Query Modification 
               index  term  (called  “term  weighting”)  is  automatic.  The  so-                                              In  IRS,  the  indexing  step  pre-processes  documents  and 
               called TF-IDF method is mainly used for knowing the weight                                                 queries  in  order  to  obtain  keywords  (relevant  words,  also 
               of a term; TF is the frequency of occurrence of a term in a                                                named  terms)  to  be  used  in  the  query.  At  this  point,  it  is 
               document  and  IDF  varies  inversely  with  the  number  of                                               important  to  consider  the  use  of  stemming  and  stopword 
               document to which the term is assigned Ropero et al, 2012).                                                (removal of words or terms that carry little or no semantically 
               iii.   Document Length                                                                                     important  information  during  searching  and  indexing 
                     This is the third factor in term weighting. When collections                                         processes) lists in order to reduce related words to their stem, 
               have documents of varying lengths, longer documents tend to                                                base or root form.  
               score  higher  since  they  contain  more  words  due  to  word                                                 Matching,  as  a  process,  involves  computation  of  the 
               repetitions. This effect is usually compensated by normalizing                                             similarity between documents and queries by weighting terms. 
               for  document  lengths  in  the  term  weighting  method.  Before                                          The  TF-IDF  and  BM25  (best  match)  algorithms  are  the 
               TREC  (Text  Retrieval  Conference),  both  the  VSM  and  the                                             frequently applied algorithms for term weighting. Base on the 
               probabilistic models developed term weighting schemes which                                                use  of  these  algorithms,  most  IRS  return  a  list  of  ranked 
               were shown to be effective on the small test collections available                                         document in response to a query where the documents more 
               then. Inception of TREC provided IR researchers with very large                                            similar to the query considered by the system are first on the 
               and  varied  test  collections  allowing  rapid  development  of                                           list.  Once  the  first  answer  set  is  obtained,  different  query 
               effective  weighting  schemes.  The  state-of-the-art  scoring                                             expansion techniques can be applied. For example, the most 
               technique  that  combines  the  above  three  factors  is  called                                          relevant keywords of the top documents previously retrieved 
               Okapi  weighting  based  document  score  (Robertson  et  al,                                              can be added to the query in order to re-rank the documents. 
               1999) as shown in the Eq. 7 below.                                                                         This process is called “relevance feedback” (RF). The retrieval 
                                                            k 11tf                    k         qtf
                           Ndf         0.5                                                 
                                                              13
                       ln                     .                                      .                        (7)         can be further enhanced by modifying the words of the queries 
                                                                     dl
                              df 0.5                                                   k      qtf
                tQ,D                           
                                                  k (1b)b                   tf          3                              using  other  keywords  more  representative  of  the  document 
                                                
                                                   1                avdl
                                                 content - e.g., including MeSH Headings (Rivas et al, 2014). 
               tf            is the term’s frequency in document                                                          In 1965, Rocchio proposed using RF for query modification 
               qtf           is the term’s frequency in query                                                             (Singhal, 2001). RF is motivated by the fact that it is easy for 
               N             is the total number of documents in the collection                                           users to judge some documents as relevant or non-relevant for 
               df            is the number of documents that contain the term                                             their query. Using such relevance judgments, a system can then 
               dl            is the document length (in bytes), and                                                       automatically  generate  a  better  query  by  adding  related  new 
               avdl          is the average document lenght                                                               terms for further searching. In general, the user is asked to judge 
               k  (between 1.0-2.0), b (usually 0.75) and k  (between 0-1000) are                                         the relevance of the top few documents retrieved by IRS. 
                 1                                                              3                                              Based on these judgments, the system modifies the query 
               constants 
                     According  to  Singhal  (2001),  the  pivoted  normalization                                         and issues the new query for finding more relevant documents 
               weighting based document score is                                                                          from  the  collection.  RF  has  been  shown  to  work  quite 
                       1ln 1 ln(tf )                                                                                    effectively across test collections.  
                                              .qtf .ln N 1                                                 (8)              Rocchio algorithm was the RF mechanism introduced and 
                                         dl                 df
                tQ,D    1ss
                                      avdl                                                                              popularized by Salton’s SMART system. In a real IR query 
               where s is a constant (usually 0.20).                                                                      context,  there  exists  a  user  query  and  partial  knowledge  of 
                                                                                                                          known relevant and non-relevant documents.  
               B.  Query Modification using Synonyms 
                     In the early years of IR, researchers realized that it was quite 
               hard  for  users  to  formulate  effective  search  requests.  It  was 
               thought  that  adding  synonyms  of  query  words  to  the  query 
               should improve search effectiveness. Early research in IR relied 
               on a thesaurus to find synonyms (Singhal, 2001).  
                     However,  it  is  quite  expensive  to  obtain  a  good  general-                                         The algorithm proposes using the modified query in Eq. 9 
               purpose  thesaurus.  Researchers  then  developed  techniques  to                                          where q  is the original query vector, D  and D  are the set of 
               automatically generate thesauri for use in query  modification.                                                        0                                                  r           nr
               Most of the automatic methods are based on analyzing word co-                                              known relevant and non-relevant documents respectively, and 
                                                                                                                          a, b, and g are weights attached to each term. These control 
                                                                                                                   199 
                
                  Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of 
                                                                                                                              
                  Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
                             International Research Journal of Advanced Engineering and Science 
                                                                                                                                ISSN (Online): 2455-9024 
            
            
           the balance between trusting the judged documents set versus              user input query. Serizawa and Kobayashi (2013) opine that 
           the query: if there is a lot of judged documents, a higher β and          several  important  points  should  be  considered  in  the 
           γ are obtained.                                                           development  and  implementation  of  algorithms  for 
               Starting  from  q0,  the  new  query  moves  the  user  some          clustering documents in very large databases. These include 
           distance toward the centroid of the relevant documents and                identifying     relevant     attributes     of   documents       and 
           some  distance  away  from  the  centroid  of  the  non-relevant          determining appropriate weights for each attribute; selecting 
           documents. This new query can be used for retrieval in the                an  appropriate  clustering  method  and  similarity  measure; 
           standard VSM. We can easily leave the positive quadrant of                estimating  limitations  on  computational  and  memory 
           the vector space by subtracting off a non-relevant document’s             resources;  evaluating  the  reliability  and  speed  of  the 
           vector.                                                                   retrieved  results;  facilitating  changes  or  updates  in  the 
               In  the  Rocchio  algorithm,  negative  term  weights  are            database,  taking  into  account  the  rate  and  extent  of  the 
           ignored. That is, the term weight is set to 0. RF can improve             changes; and selecting an appropriate search algorithm for 
           both recall and precision. But, in practice, it has been shown to         retrieval  and  ranking.  This  final  point  is  of  particularly 
           be most useful for increasing recall in situations where recall           great  concern  for  Web-based  searches.  Serizawa  and 
           is important. This is partly because the technique expands the            Kobayashi  (2013)  further  stress  further  that  there  are  two 
           query, but it is also partly an effect of the use case: when they         main  categories  of  clustering:    hierarchical  and  non-
           want high recall, users can be expected to take time to review            hierarchical. Hierarchical methods show greater promise for 
           results  and  to  iterate  on  the  search.  Positive  feedback  also     enhancing Internet search  and retrieval systems.  Although 
           turns out to be much more valuable than negative feedback,                details  of  clustering  algorithms  used  by  major  search 
           and so most IRS set γ < β. Reasonable values might be α = 1, β            engines  are  not  publicly  available,  some  general 
           = 0.75, and γ = 0.15. In fact, many IRS allow only positive               approaches  are  known.  For  instance,  Digital  Equipment 
           feedback,  which  is  equivalent  to  setting  γ  =  0.  Another          Corporation’s  Web  search  engine,  AltaVista,  is  based  on 
           alternative is to use only the marked non-relevant documents              clustering.  Anick  (2003)  explore  how  to  combine  results 
           which  received  the  highest  ranking  from  the  IR  system  as         from  latent  semantic  indexing  and  analysis  of  phrases  for 
           negative feedback.                                                        context-based information retrieval on the Web. 
               New techniques to do meaningful QE in absence of any user             E  Natural Language Processing (NLP)  
           feedback were developed early 1990s. Most notable of these is                 NLP has also been proposed as a tool to enhance retrieval 
           pseudo-feedback, a variant of relevance feedback (Buckley et              effectiveness but with very limited success (Strzalkowski et al, 
           al, 1995). Given that the top few documents retrieved by an               1997).  Despite that document ranking is a critical application for 
           IRS are often on the general query topic, selecting related terms         IR,  it  is  definitely  not  the  only  application.  The  field  has 
           from  these  documents  should  yield  useful  new  terms                 developed  techniques  to  attack  many  different  problems  like 
           irrespective  of  document  relevance.  In  pseudo-feedback,  the         information  filtering  (Belkin  and  Croft  (1992),  topic  detection 
           IRS assumes that the top few documents retrieved based on the             and  tracking  (or  TDT)  (Allan  et  al,  2000),  speech  retrieval 
           initial user query are “relevant”, and does RF to generate a new          ((Sparck  et  al,  2000),  cross-language  retrieval  (Grefenstette, 
           query. This expanded new query is then used to rank documents             1998),  question  answering  (Pasca  and  Harabagiu,  2001),  and 
           for presentation to the user. Pseudo feedback has been shown to           many more.   
           be a very effective technique, especially for short user queries.  
           D.  Document Clustering                                                   F.  Indexing 
               Many other techniques have been developed over the years                  The  term  “indexing”  is  used  in  the  same  spirit  in  the 
           with varying degree of success.  This is a process of grouping            context  of  retrieval  and  ranking  has  a  specific  meaning. 
           similar  documents together to perform the task of IR fast and            Some definitions proposed by experts are “a collection of 
           efficiently.  It  is  just  one  of  several  ways  of  organizing        terms  with  pointers  to  places  where  information  about 
           documents  to  facilitate  retrieval  from  large  databases.             documents  can  be  found”  (Manber.  1999).  Indexing  is 
           Clustering hypothesis states that documents that cluster together         building a data structure that will allow quick searching 
           (are very similar to each other) will have a similar relevance            of the text (Baeza-Yates and Ribeiro-Neto, 1999) or the act 
           profile  for  a  given  query  (Griffiths  and  Steyyers,  2004).         of  assigning  index  terms  to  documents,  which  are  the 
           Document clustering techniques were (and still are) an active             objects to be retrieved  (Korfhage, 1997). Serizawa and 
           area of research. Though the usefulness of document clustering            Kobayashi  (2013)  identified  four  approaches  to  indexing 
           for improved search effectiveness (or efficiency) has been very           documents on the Web which are (1) human or manual 
           limited, document clustering has allowed several developments             indexing;  (2)  automatic  indexing;  (3)  intelligent  or  agent-
           in IR, e.g., for browsing and search interfaces.                          based  indexing;  and  (4)  metadata,  resource  description 
               During  the  IR  and  ranking  process,  two  classes  of             framework  (RDF),  and  annotation-based  indexing.  The 
           similarity measures must be considered: (i) the similarity of             first two appear in many classical texts, while the latter 
           a  document  and  a  query;  and  (ii)  the  similarity  of  two          two  are  relatively  new  and  promising  areas  of  study. 
           documents in a database. The similarity of two documents                  However, the development of effective indexing tools to aid 
           is  important  for  identifying  groups  of  documents  in  a             in  filtering  is  another  major  class  of  problems  associated 
           database that can be retrieved and processed together for a               with Web-based search and retrieval. 
                                                                                200 
            
             Olalere A. Abass and Oluremi A. Arowolo, “Information retrieval models, techniques and applications,” International Research Journal of 
                                                                                        
             Advanced Engineering and Science, Volume 2, Issue 2, pp. 197-202, 2017.
The words contained in this file might help you see if this file matches what you are looking for:

...International research journal of advanced engineering and science issn online information retrieval models techniques applications olalere a abass oluremi arowolo dept computer tai solarin college education omu ijebu ogun state nigeria abstract an ir system focuses on the different areas application finally processing data collection by means representation storage conclusion paper is drawn in section searching for purpose knowledge discovery response to user request via query tendency irs produce relevant ii documents with high precision recall meet s need based framework input depends adoption appropriate according sharma patel there are three search engine this we explain concepts traditional which various rely upon basic processes has support i equally give detail description that have been content successfully applied store manage retrieve from iii comparison two huge amount available users systems shows representations visualized figure as these digital libraries opined squared ...

no reviews yet
Please Login to review.