In this Information Age, with its often overwhelming access to content,
making information retrieval manageable is a top business priority. With
mySAP Workplace, SAP provides personalization and role-based preconfiguration
of an easy-to-navigate structure as one way to make information more accessible.
With upcoming releases, the Workplace will
also include an intelligent search engine.¹ Users can search in-house
SAP and non-SAP systems, as well as the Web, for all sorts of unstructured
content.² Regardless of where the target information resides, users
can search for it by file name, terms, phrases, subject, or a person's
The highly sophisticated search engine
of mySAP Workplace will offer:
- Automated classification of documents from a variety of different
sources (internal SAP systems, non-SAP systems, and the Web)
- The ranking of each document it finds, plus the reason the document
has been retrieved
- Hints for accessing similar topics
- The option to be notified about new documents that fall into specified
This article examines both the technical
background of SAP's search and classification methodology for unstructured
content, and the functionality users can expect from these new developments.
Figure 1 highlights the relationship between SAP's search and retrieval
functionality and the architecture of mySAP Workplace.
||SAP's Search and Retrieval Functionality in the mySAP
Two Approaches to Search Capabilities
There are two basic approaches to search engine capabilities:
- The traditional Boolean text approach
- The vector-based classification method, which is the basis for advanced
- Boolean Search Engines with Inverted Index
The traditional approach to indexing and searching text supports Boolean
expressions (AND, OR, etc.) and uses a structure called the inverted index.
The inverted index creates a record of the location of each word in a
database and associates a list of documents with all terms found. The
term list is called the index. The list of documents associated with each
index entry is called the occurrence list (or posting list).
When a user enters the search term "president,"
for example, the inverted index delivers the occurrence list for the word
"president." The inverted index is usually maintained as an automatically
generated dictionary, with a list of "occurrence pointers" (which, of
course, point the search to occurrence lists) associated with each word.
The search can also include constraints
in the form of Boolean expressions connecting several search terms.
|Example: If you are looking for information on US presidential
elections, but not election results in Florida, you would enter "US
AND president AND election NOT Florida." The search process looks
up these terms in the index and retrieves from the occurrence list
only the documents that satisfy these constraints.
The advantage of this traditional method
is that it allows you to issue exact queries as complex Boolean expressions.
The disadvantage is that this method doesn't directly support advanced
functions, such as ranking documents by significance or subsequent searches
for similar documents.
The Vector-Based Approach: The Foundation for SAP's New Search Functionality
SAP's new search and classification functionality uses both the Boolean
and vector-based approaches. In the vector-based method, the following
basic concepts apply:
The documents in a repository (local file server, Web server, Lotus database,
etc.) and the words occurring in these documents form an n*m-dimensional
matrix. In the simplest case, documents are indexed according to the x
most significant words that occur within them when they are checked in.
This results in a document vector for each document in the n*m-dimensional
matrix, where the matrix is defined by all involved words and vectors.
The features of the resulting vectors are
words or combinations of words in a document (content or document vector)
or query (search or query vector). The vectors are weighted to give emphasis
to features that exemplify meaning and are useful in retrieval. When a
query is issued by a user, the query vector is compared to each document
vector. Those that are closest to the query are considered to be similar,
and are returned in the search result.
The vector-based approach allows the search
tool to assign a true rank to the documents it retrieves. The rank indicates
the level of match to your search query. Ranking the results in this approach
also assists in the automatic identification of document content classes
(manually or automatically defined groups of documents with similar content),
as highlighted in Figure 2.
Calculating the Significance of Search Terms
One formula for computing the significance of a particular search
term is TF x IDF. Simply put, this formula multiplies "times found,"
or "TF," (i.e., the number of times a word appears in the document)
by the inverse document frequency, or "IDF," of the word. (IDF is
equal to 1 divided by the number of times the word appears in the
entire collection of documents.)
The IDF factor is helpful in determining the content-discriminating
power of a particular search term.
For example, a term that appears
rarely (e.g., "pancreas") has a high IDF, while a term that occurs
in many documents (e.g., "is") has a low IDF. Using this formula,
the search engine will identify "pancreas" as a much better indicator
of a match than the word "is."
||Automatic Classification Categorizes New Documents by
Content and Assigns Them to Content Classes
Most true ranking algorithms of Web resources
are using similarity measures based on vector-based models. To compute
the similarities, one can think of each document as a one-dimensional
vector. Words in these vectors are weighted by significance. (See sidebar,
"Calculating the Significance of Search Terms.")
||Search and Classification Functionality Available in
SAP Knowledge Warehouse and in mySAP Workplace
SAP's Search and Classification Based on Vector Models
SAP's methods for search and classification can be used in the future
by any mySAP solution or non-SAP application via XML-APIs. Indexing and
searching is not limited to SAP repositories, but can be extended to the
Web or third-party databases and file servers.
mySAP Workplace and its Web content management
components will be among the first to offer the extensive search and classification
functionality listed in Figure 3.
Start with the Basic Search
A search can start from various types of user interfaces, including mySAP
Workplace, any connected mySAP application, or even an independent interface
designed by customers and their consultants.
Some of the search features described in
the following sections are already part of SAP Knowledge Warehouse 5.1,
while the others will be added by SAP Workplace soon. (Figure 3 specifies
which solutions offer the search and classification functionality discussed
in this article.)
A search starts in the way familiar to
most users - by typing in a search term, and perhaps specifying Boolean
operators such as AND, OR, etc. Then, using vector-based methods, the
system displays the result as a ranked list of documents, and includes
a summary of the differentiating features of each document.
"Exact" or "Fuzzy" Search
The user or the administrator can set searches to be conducted as "exact"
or "fuzzy" with a single click.
In an exact search, the user's term leads
to a search only for an exact match. In a fuzzy search, the search engine
gives the user some flexibility. It takes segments of the search terms
of a user's query and compares them to similarly segmented index entries.
If there is a high total similarity between the segments, the search engine
assumes a hit.
This is especially helpful when there is
no exact match for the user's search term.
Example:A user types in a search for "presidant" (misspelling
the word "president"). In a fuzzy search, assuming the search engine
will portion the word into thirds (for each of the three syllables),
the search of the first and second segments will return a 100% match.
The third portion of the search term returns a 67% match. The averaged
similarity is 89%, and the search for "presidant" returns documents
that include the word "president" - documents that would not have
been retrieved in an exact search.
Of course, this is a highly simplified
example. The actual algorithm for the fuzzy search works with a similar
but more complex method of segmented comparison and similarity calculation.
In addition to basic Boolean text searches, users can also specify attributes,
such as author or last change date of a document. These attributes must
be maintained in the searched document repository.
Ranking of Documents
As detailed above, all documents in a search result are ranked according
to the similarity of their content vectors to the search vector. This
provides a good measure of their relevance to the searching user.
Ranking clearly only delivers meaningful
results when the request involves a search of the document's text content
(or its resulting content vector), rather than an attribute search, which
can have a "binary result" (for example, the attribute Author=Smith is
either true or false). In that case, of course, ranking is superfluous.
Features of each document can be extracted and displayed with a search
result (see the previous section, "Ranking of Documents"). Listing the
features is often a much better way to communicate the actual content
of a document than simply displaying its first few lines.
|Example:A search for "president" may correspond to documents
with features "Palm Beach," "election," and "Republicans," or with
the features "Israel," "peace talks," and "Palestinian." Obviously
these are two completely different types of content, but both documents
could start with the same text: "Washington, D.C. - The president..."
Extend the Basic Search
If a user chooses the advanced search options, or after an initial simple
search, there are several ways to extend or narrow down the results intelligently.
The "See Also" Option
After your original search, SAP's vector-based search engine provides
the option to mark any number of documents in the search result and then
search for other similar documents, as shown in Figure 4. This
option allows you to find similar documents, even if your original search
text is not contained in those documents.
||Using the "See Also" Option
|Example: If your search for "president" returns information
about both the US elections and the Middle East peace talks, you can
then mark documents related to the peace talks only, and begin another
search for similar documents. The new result will contain documents
that may not contain the word "president," but will reflect the new
context of your search.
Search for Similar or Related Terms
The Search for similar or related terms is, in some ways, like the "See
Also" and "Feature Extraction" functionality. For a given word or term,
the search result will list all features that occur together with the
search term you entered. Thus, your result in this case is a list of words
and phrases instead of links to documents.
This option, like the See Also and Feature
Extraction options, allows you to conduct alternative searches if the
original search term does not lead to the desired results.
Find Subject Matter Experts
If the attribute "author" is maintained for all documents in a repository,
it is possible to match up terms or words with a particular individual.
You may, for instance, search the internal
technical documentation of a car manufacturer for the term "electronic
fuel injection." With a search for subject matter experts in this area,
the system delivers the names of authors who have written the texts that
most frequently contain the search term. The same principle may also be
used to match search terms to any other maintained attribute.
Methods for Document Classification
To provide a hierarchical navigation through a larger number of
indexed and classified documents, content classes (manually or automatically
defined groups of documents with similar content) are needed to
structure the index. These classes can be generated in one of two
ways: either by defining a number of documents as characteristic
of each class that should be created, or by using clustering methods.
Clustering repeatedly compares document vectors (defined earlier in
this article) and sorts them into clusters. Then it compares and groups
the cluster vectors until the desired granularity of document clusters
or classes has been reached.
After initial content classes have
been defined, the system can automatically assign new documents
to those classes when they are introduced into the classified document
repository (see Figure 2).
Some approaches to clustering include these methods:
- The Centroid Method, simply speaking, uses the average
vectors of document groups to classify new documents.
- The K Nearest Neighbors Method searches the K most similar
single documents in the repository K (where K is a constant specified
by the author of a specific algorithm). The degrees of similarity
to the new document are added for each class to which these "neighbors"
belong. The new document will belong to the class with the highest
sum of similarity degrees.
- The Least Linear Squares Fit Method builds a class-document
matrix. This matrix, combined with the original document-word
matrix, determines a factor matrix that can be multiplied with
the vector of newly checked-in documents to determine their place
in the class-document matrix.
Additional methods use different approaches, but all are based
on vectors in the n*m-dimensional word-document matrix. SAP is leveraging
state-of-the-art algorithms from this pool of vector-based methods
to achieve a highly intelligent classification and retrieval of
Content Classification of Documents
Automatically categorizing documents by content classes is only possible
using vector-based systems. Initial content classes can be defined manually
or by using clustering methods. (See sidebar, "Methods for Document Classification,"
These content classes can help users to
navigate through sub-topics before they start an actual search. They can
also serve as a basis for subscription.
Subscription to Content Classes
If users want to be notified about new documents, they can indicate this
by content class, or by the same checkbox principle used in a See Also
search. When a document is automatically classified as belonging to a
subscribed class or is recognized as similar to a marked document, the
user receives a notification in his or her inbox.
Retrieve External Web Content with Web Crawlers
Web Crawlers can be sent out to "crawl" document repositories on the
Web. An administrator specifies a URL and a depth of links (e.g., "www.cnn.com"
to a depth of three links) up to which all documents are located and retrieved.
The classification engine then organizes the located documents by class.
Integrate Internal Content and Web Sources
Using the Web Crawler, the content of Web pages can be included in the
internally maintained index, either with a copy of the document itself
on the content server, or as metadata, index entry, and URL only.
SAP's search functionality is based on state-of-the-art algorithms and
will significantly improve the structuring and retrieval of information.
The features described in this article will enable users to:
- Conduct exact searches for specific information
- Retrieve related and additional information
- Conduct fuzzy searches to retrieve information, even with inexact
- Judge the relevance of information quickly
- Receive options and hints for further searches
- Relate information to experts and authors
You can leverage the search and classification
functionality by SAP to save your users and knowledge workers great amounts
of time, giving them the ability to actually work with the information
they find, and - who knows? - perhaps even boosting productivity and creativity.
Karsten Hohage is with the Product Management Team of mySAP Business
Intelligence & mySAP Workplace. He can be reached at email@example.com.
|¹ mySAP Workplace will include the classification and search
functionality described here, as well as a suite of Web content management
features. Readers may recognize some of these features from the SAP
Knowledge Warehouse, but most are new developments to suit the content
management needs of a portal solution. Web content management and
search capabilities will be included in mySAP Workplace, and no separate
KW license is required.
| ² As opposed to "structured content" or "transactional content."
Unstructured content in this case refers to text documents in various
formats (HTML, DOC, TXT, etc.) that may or may not have metadata to
describe them, but whose text content itself is not maintained in
a database field. The search functionality is therefore comparable
to those known from search engines on the Internet. Note that "structured
content" may also be interlinked with documents.