GRC
HR
SCM
CRM
BI


Article

 

mySAP Workplace Offers Intelligent Classification and Search Methods

by Karsten Hohage | SAPinsider

April 1, 2001

by Karsten Hohage, SAP SAPinsider - 2001 (Volume 2), April (Issue 2)
 

In this Information Age, with its often overwhelming access to content, making information retrieval manageable is a top business priority. With mySAP Workplace, SAP provides personalization and role-based preconfiguration of an easy-to-navigate structure as one way to make information more accessible.

     With upcoming releases, the Workplace will also include an intelligent search engine.¹ Users can search in-house SAP and non-SAP systems, as well as the Web, for all sorts of unstructured content.² Regardless of where the target information resides, users can search for it by file name, terms, phrases, subject, or a person's name.

     The highly sophisticated search engine of mySAP Workplace will offer:

  • Automated classification of documents from a variety of different sources (internal SAP systems, non-SAP systems, and the Web)
  • The ranking of each document it finds, plus the reason the document has been retrieved
  • Hints for accessing similar topics
  • The option to be notified about new documents that fall into specified search areas

     This article examines both the technical background of SAP's search and classification methodology for unstructured content, and the functionality users can expect from these new developments. Figure 1 highlights the relationship between SAP's search and retrieval functionality and the architecture of mySAP Workplace.

Figure 1 SAP's Search and Retrieval Functionality in the mySAP Workplace Architecture

Two Approaches to Search Capabilities

There are two basic approaches to search engine capabilities:

  • The traditional Boolean text approach
  • The vector-based classification method, which is the basis for advanced retrieval mechanisms
  • Boolean Search Engines with Inverted Index

The traditional approach to indexing and searching text supports Boolean expressions (AND, OR, etc.) and uses a structure called the inverted index. The inverted index creates a record of the location of each word in a database and associates a list of documents with all terms found. The term list is called the index. The list of documents associated with each index entry is called the occurrence list (or posting list).

     When a user enters the search term "president," for example, the inverted index delivers the occurrence list for the word "president." The inverted index is usually maintained as an automatically generated dictionary, with a list of "occurrence pointers" (which, of course, point the search to occurrence lists) associated with each word.

     The search can also include constraints in the form of Boolean expressions connecting several search terms.

Example: If you are looking for information on US presidential elections, but not election results in Florida, you would enter "US AND president AND election NOT Florida." The search process looks up these terms in the index and retrieves from the occurrence list only the documents that satisfy these constraints.

     The advantage of this traditional method is that it allows you to issue exact queries as complex Boolean expressions. The disadvantage is that this method doesn't directly support advanced functions, such as ranking documents by significance or subsequent searches for similar documents.

The Vector-Based Approach: The Foundation for SAP's New Search Functionality

SAP's new search and classification functionality uses both the Boolean and vector-based approaches. In the vector-based method, the following basic concepts apply:

The documents in a repository (local file server, Web server, Lotus database, etc.) and the words occurring in these documents form an n*m-dimensional matrix. In the simplest case, documents are indexed according to the x most significant words that occur within them when they are checked in. This results in a document vector for each document in the n*m-dimensional matrix, where the matrix is defined by all involved words and vectors.

     The features of the resulting vectors are words or combinations of words in a document (content or document vector) or query (search or query vector). The vectors are weighted to give emphasis to features that exemplify meaning and are useful in retrieval. When a query is issued by a user, the query vector is compared to each document vector. Those that are closest to the query are considered to be similar, and are returned in the search result.

     The vector-based approach allows the search tool to assign a true rank to the documents it retrieves. The rank indicates the level of match to your search query. Ranking the results in this approach also assists in the automatic identification of document content classes (manually or automatically defined groups of documents with similar content), as highlighted in Figure 2.

Calculating the Significance of Search Terms

One formula for computing the significance of a particular search term is TF x IDF. Simply put, this formula multiplies "times found," or "TF," (i.e., the number of times a word appears in the document) by the inverse document frequency, or "IDF," of the word. (IDF is equal to 1 divided by the number of times the word appears in the entire collection of documents.)

The IDF factor is helpful in determining the content-discriminating power of a particular search term.

     For example, a term that appears rarely (e.g., "pancreas") has a high IDF, while a term that occurs in many documents (e.g., "is") has a low IDF. Using this formula, the search engine will identify "pancreas" as a much better indicator of a match than the word "is."



Figure 2 Automatic Classification Categorizes New Documents by Content and Assigns Them to Content Classes

     Most true ranking algorithms of Web resources are using similarity measures based on vector-based models. To compute the similarities, one can think of each document as a one-dimensional vector. Words in these vectors are weighted by significance. (See sidebar, "Calculating the Significance of Search Terms.")

Functionality
SAP Knowledge Warehouse 5.1
SAP Workplace
Exact Search
Fuzzy Search
 
Attribute-Based Search
Ranking of Documents
Feature Extraction
 
"See Also" Option
Search for Similar/Related Terms
 
Integrate Web and Internal Information
 
Crawl Different Repositories
 
Classify Documents Automatically
 
Figure 3 Search and Classification Functionality Available in SAP Knowledge Warehouse and in mySAP Workplace

SAP's Search and Classification Based on Vector Models

SAP's methods for search and classification can be used in the future by any mySAP solution or non-SAP application via XML-APIs. Indexing and searching is not limited to SAP repositories, but can be extended to the Web or third-party databases and file servers.

     mySAP Workplace and its Web content management components will be among the first to offer the extensive search and classification functionality listed in Figure 3.

Start with the Basic Search

A search can start from various types of user interfaces, including mySAP Workplace, any connected mySAP application, or even an independent interface designed by customers and their consultants.

     Some of the search features described in the following sections are already part of SAP Knowledge Warehouse 5.1, while the others will be added by SAP Workplace soon. (Figure 3 specifies which solutions offer the search and classification functionality discussed in this article.)

     A search starts in the way familiar to most users - by typing in a search term, and perhaps specifying Boolean operators such as AND, OR, etc. Then, using vector-based methods, the system displays the result as a ranked list of documents, and includes a summary of the differentiating features of each document.

"Exact" or "Fuzzy" Search

The user or the administrator can set searches to be conducted as "exact" or "fuzzy" with a single click.

     In an exact search, the user's term leads to a search only for an exact match. In a fuzzy search, the search engine gives the user some flexibility. It takes segments of the search terms of a user's query and compares them to similarly segmented index entries. If there is a high total similarity between the segments, the search engine assumes a hit.

     This is especially helpful when there is no exact match for the user's search term.

Example:A user types in a search for "presidant" (misspelling the word "president"). In a fuzzy search, assuming the search engine will portion the word into thirds (for each of the three syllables), the search of the first and second segments will return a 100% match. The third portion of the search term returns a 67% match. The averaged similarity is 89%, and the search for "presidant" returns documents that include the word "president" - documents that would not have been retrieved in an exact search.

     Of course, this is a highly simplified example. The actual algorithm for the fuzzy search works with a similar but more complex method of segmented comparison and similarity calculation.

Attribute-Based Search

In addition to basic Boolean text searches, users can also specify attributes, such as author or last change date of a document. These attributes must be maintained in the searched document repository.

Ranking of Documents

As detailed above, all documents in a search result are ranked according to the similarity of their content vectors to the search vector. This provides a good measure of their relevance to the searching user.

     Ranking clearly only delivers meaningful results when the request involves a search of the document's text content (or its resulting content vector), rather than an attribute search, which can have a "binary result" (for example, the attribute Author=Smith is either true or false). In that case, of course, ranking is superfluous.

Feature Extraction

Features of each document can be extracted and displayed with a search result (see the previous section, "Ranking of Documents"). Listing the features is often a much better way to communicate the actual content of a document than simply displaying its first few lines.

Example:A search for "president" may correspond to documents with features "Palm Beach," "election," and "Republicans," or with the features "Israel," "peace talks," and "Palestinian." Obviously these are two completely different types of content, but both documents could start with the same text: "Washington, D.C. - The president..."

Extend the Basic Search

If a user chooses the advanced search options, or after an initial simple search, there are several ways to extend or narrow down the results intelligently.

The "See Also" Option

After your original search, SAP's vector-based search engine provides the option to mark any number of documents in the search result and then search for other similar documents, as shown in Figure 4. This option allows you to find similar documents, even if your original search text is not contained in those documents.

Figure 4 Using the "See Also" Option


Example: If your search for "president" returns information about both the US elections and the Middle East peace talks, you can then mark documents related to the peace talks only, and begin another search for similar documents. The new result will contain documents that may not contain the word "president," but will reflect the new context of your search.

Search for Similar or Related Terms

The Search for similar or related terms is, in some ways, like the "See Also" and "Feature Extraction" functionality. For a given word or term, the search result will list all features that occur together with the search term you entered. Thus, your result in this case is a list of words and phrases instead of links to documents.

     This option, like the See Also and Feature Extraction options, allows you to conduct alternative searches if the original search term does not lead to the desired results.

Find Subject Matter Experts

If the attribute "author" is maintained for all documents in a repository, it is possible to match up terms or words with a particular individual.

     You may, for instance, search the internal technical documentation of a car manufacturer for the term "electronic fuel injection." With a search for subject matter experts in this area, the system delivers the names of authors who have written the texts that most frequently contain the search term. The same principle may also be used to match search terms to any other maintained attribute.

Methods for Document Classification

To provide a hierarchical navigation through a larger number of indexed and classified documents, content classes (manually or automatically defined groups of documents with similar content) are needed to structure the index. These classes can be generated in one of two ways: either by defining a number of documents as characteristic of each class that should be created, or by using clustering methods.

Clustering repeatedly compares document vectors (defined earlier in this article) and sorts them into clusters. Then it compares and groups the cluster vectors until the desired granularity of document clusters or classes has been reached.

     After initial content classes have been defined, the system can automatically assign new documents to those classes when they are introduced into the classified document repository (see Figure 2).

Some approaches to clustering include these methods:

  • The Centroid Method, simply speaking, uses the average vectors of document groups to classify new documents.
  • The K Nearest Neighbors Method searches the K most similar single documents in the repository K (where K is a constant specified by the author of a specific algorithm). The degrees of similarity to the new document are added for each class to which these "neighbors" belong. The new document will belong to the class with the highest sum of similarity degrees.
  • The Least Linear Squares Fit Method builds a class-document matrix. This matrix, combined with the original document-word matrix, determines a factor matrix that can be multiplied with the vector of newly checked-in documents to determine their place in the class-document matrix.

Additional methods use different approaches, but all are based on vectors in the n*m-dimensional word-document matrix. SAP is leveraging state-of-the-art algorithms from this pool of vector-based methods to achieve a highly intelligent classification and retrieval of documents.

Content Classification of Documents

Automatically categorizing documents by content classes is only possible using vector-based systems. Initial content classes can be defined manually or by using clustering methods. (See sidebar, "Methods for Document Classification," above.)

     These content classes can help users to navigate through sub-topics before they start an actual search. They can also serve as a basis for subscription.

Subscription to Content Classes

If users want to be notified about new documents, they can indicate this by content class, or by the same checkbox principle used in a See Also search. When a document is automatically classified as belonging to a subscribed class or is recognized as similar to a marked document, the user receives a notification in his or her inbox.

Retrieve External Web Content with Web Crawlers

Web Crawlers can be sent out to "crawl" document repositories on the Web. An administrator specifies a URL and a depth of links (e.g., "www.cnn.com" to a depth of three links) up to which all documents are located and retrieved. The classification engine then organizes the located documents by class.

Integrate Internal Content and Web Sources

Using the Web Crawler, the content of Web pages can be included in the internally maintained index, either with a copy of the document itself on the content server, or as metadata, index entry, and URL only.

Conclusion

SAP's search functionality is based on state-of-the-art algorithms and will significantly improve the structuring and retrieval of information. The features described in this article will enable users to:

  • Conduct exact searches for specific information
  • Retrieve related and additional information
  • Conduct fuzzy searches to retrieve information, even with inexact queries
  • Judge the relevance of information quickly
  • Receive options and hints for further searches
  • Relate information to experts and authors

     You can leverage the search and classification functionality by SAP to save your users and knowledge workers great amounts of time, giving them the ability to actually work with the information they find, and - who knows? - perhaps even boosting productivity and creativity.


Karsten Hohage is with the Product Management Team of mySAP Business Intelligence & mySAP Workplace. He can be reached at karsten.hohage@sap.com.
¹ mySAP Workplace will include the classification and search functionality described here, as well as a suite of Web content management features. Readers may recognize some of these features from the SAP Knowledge Warehouse, but most are new developments to suit the content management needs of a portal solution. Web content management and search capabilities will be included in mySAP Workplace, and no separate KW license is required.
² As opposed to "structured content" or "transactional content." Unstructured content in this case refers to text documents in various formats (HTML, DOC, TXT, etc.) that may or may not have metadata to describe them, but whose text content itself is not maintained in a database field. The search functionality is therefore comparable to those known from search engines on the Internet. Note that "structured content" may also be interlinked with documents.

An email has been sent to:






More from SAPinsider



COMMENTS

Please log in to post a comment.

No comments have been submitted on this article. Be the first to comment!


SAPinsider
FAQ