Read Searching for Experts with Expertise-Locator text version

Searching for Experts with Expertise-Locator Knowledge Management Systems

Irma Becerra-Fernandez, Ph.D. Florida International University Miami, FL 33199

[email protected]

Abstract

This paper discusses expertise-locator knowledge management systems, and specifically, the implementation details of two such systems: the Searchable Answer Generating Engine (SAGE) and Expert Seeker. Funded by NASA, SAGE serves to search for researchers in universities in Florida. Expert Seeker will serve to search for experts in one of the best-known knowledge organizations: the National Aeronautics and Space Administration. Implementation details, results to date, and future plans are also presented.

1. Introduction

Knowledge management systems (KMS) have been defined as "an emerging line of systems [which] target professional and managerial activities by focusing on creating, gathering, organizing, and disseminating an organization's `knowledge' as opposed to `information' or `data'" (Alavi and Leidner, 1999). Based on the KM Life Cycle models (Nissen, 2000) and on a study of the KMS underway at many organizations (Becerra-Fernandez, 1998a) a framework emerges for classification of KMS (Becerra-Fernandez and Stevenson, 2000). The framework includes the following: 1. Knowledge Preservation: Refers to systems that preserve and formalize the knowledge of experts so it can be shared with others. As such, these systems aim to elicit and catalog the tacit knowledge of experts, and serve to transfer their knowledge. Knowledge preservation systems formalize knowledge in models such as concept maps, which allow others to learn the domain (Cañas et. al., 1999).

Knowledge Application: Refers to 2. systems that assist in solving problems. Organizations with significant intellectual capital require eliciting and capturing knowledge for reuse in solving new problems as well as recurring old problems. New problems could be similar to old problems or even consist of a combination of old problems (BecerraFernandez and Aha, 1999a). 3. Knowledge Discovery: Refers to systems that create new knowledge through the implementation of intelligent algorithms such as data mining, and through the inference of data relationships (Fayyad et. al. 1996). 4. Knowledge Repository: Refers to systems that organize and distribute knowledge. Knowledge repositories comprise the majority of the KMS currently in place. Expertise-locator systems (also called knowledge yellow pages or peoplefinder systems) is a special type of knowledge repository that point to experts, those that have the knowledge within the organization (BecerraFernandez, 2000a, 2000b, 2000c).

2. A Survey of Expertise-Locator Systems

Several organizations in different business categories have identified the need to develop systems to help locate intellectual capital, or Expertise-locator KMS. The intent when developing these systems is to catalog knowledge competencies, including information not typically captured by Human Resources systems, in a way that could later be queried across the organization. Prior to embarking on the development of such systems, a literary review of hallmark expertise-locator KMS was implemented, and follow-up personal interviews with the developers of such systems were held. Details about the implementation of these

expertise-locator KMS are discussed in BecerraFernandez (2000a). Previous research (Davenport, 1996) conducted to establish the parameters to design Expert Seeker application has demonstrated that one of the challenges in developing expertise-locator KMS is related to the inherent shortcoming of self-assessment (Becerra-Fernandez, 2000a). Another challenge in developing expertiselocator KMS deals with the development of knowledge taxonomies. Knowledge taxonomies could be critical in the expertise-locator system's success (Davenport, 1996). The use of web data mining can mitigate some of the problems inherent to relying on the biased self-reporting required to keep employee profiles up to date, or the need to develop an accurate organizational skill taxonomy a-priori. Data mining refers to the extraction of information or the identification of patterns, usually within a large collection of data (Fayyad et. al., 1996; Ahonen et.al.1997). Web data mining makes use of data mining techniques to extract information from web-related data. This paper describes the development of two such systems: the Searchable Answer Generating Environment (SAGE), an expertise locator system within universities in Florida, and Expert Seeker, an expertise locator system to be rolled out at the National Aeronautics and Space Administration (NASA).

3. The Searchable Answer Generating Environment (SAGE) Expert-Finder Knowledge Management System

The SAGE system combines the unified database by masking multiple databases as if they were one. The main interfaces developed on the query engine use text fields to search the processed data for key words, fields of expertise, names, or other applicable search fields. SAGE also includes a thesaurus, and a collection of concepts1 that form an ontology2, that upon request can perform a search on words that are similar to the keyword in use. A consistently applied taxonomy is used to improve upon the usual keyword and full text based techniques. It allows an end-user to retrieve information using appropriate terminology and avoids problems of poor selectivity and quality of results caused by missing, inconsistent, or conflicting vocabulary.

4. The Technologies to Implement SAGE

The development of SAGE was marked by two design requirements: the need to validate the data used to identify the experts, and at the same time minimize the impact of each of the universities' offices of sponsored research, who collect most of the required data. SAGE is built upon a searching criterion that is recognized as a valid indicator of expertise, which is based on funded-research grants received. The development of the SAGE database involved an initial design followed by incremental implementation phases. Cold FusionTM was chosen as the middle-ware development environment because of its significant application strength and its demonstrated database interaction capabilities. The use of a thesaurus extends the capability of the website by generating new keywords from an existing input provided by the user. The thesaurus provides a standardized means of organizing many kinds of information, including both conceptual and taxonomic. In other words, the thesaurus is a tool designed to aid users in finding their way around a vocabulary database. In addition to its traditional use as an authority for the terms used in indexing the database, it

Concepts are the entities and relationships of a semantic network. 2 Ontology is an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships that hold among them.

1

The NASA/Florida Minority Institution Entrepreneurial Partnership (FMIEP) grant is funding the development of the Searchable Answer Generated Environment (SAGE), which is in the category of Expertise Locator KMS (Becerra-Fernandez, 1998a). The purpose of this KMS is to create a repository of experts in Florida (FL) universities. Currently, each university in Florida keeps a database of funded research, but these databases are disparate and dissimilar. The SAGE Expert-Finder creates a single repository by incorporating a distributed database scheme, which can be searched by a variety of fields, including research topic, investigator name, funding agency or university.

offers reminders of terms the user might not even have considered. The construction of the thesaurus is accomplished using Perl programming language due to its powerful text processing capabilities. The script uses an existing pool of information, which comes from Wordsmith Educational Dictionary Thesaurus (WEDT) that includes over 50,000 headwords, and very precisely defined and hyper-linked synonyms. It retrieves an extended set of related terms or set of synonyms. Cold Fusion 4.5 is used to cache the results of the query. In addition, the new output is used in conjunction with the Verity search engine, which utilizes a stoplist3 and uses stemming. To achieve results we use inter-process communication (IPC)4. When a user submits a search, the script would issue an HTTP request to a remote server by communicating through a socket (connection). The HTTP request queries an external search engine, which resides on the remote server. The script retrieves the HTML document generated by the search engine, parses the document by using regular expressions, and retrieves the desired information. Basically, since it is issuing a request, the script is acting as web client. In terms of enhancements or improvements, using a set of different libraries has reduced the response time of the search. At first the library IO::Socket was used, which was not efficient because the response time was around 15-20 seconds. Now we are using LWP::UserAgent and HTTP::Request, which result in a response time of less than 5 seconds. The script relies on the structure of the document, through web structure mining. In this case the document is a raw HTML document. In order to retrieve data from an HTML document efficiently, the document must have a uniform format or a "cue" from where to start looking for the necessary data, so the programming task was trivial.

3

The evaluation phase of SAGE included heavy end user testing and processing time optimization. SAGE has been online since August 16, 1999 at http://sage.fiu.edu. SAGE receives approximately twenty hits per day, of which about four are new visitors. The majority of these hits come from institutes of higher education, such as the military and commercial sites. Some of the commercial sites are companies developing advanced search engines or companies involved in the development of KMS. Our visitors come from the US and from around the world, including Japan, France, Austria, Switzerland, Bahamas, Mexico, UK. Future developments for SAGE include the development of algorithms that will facilitate the maintenance of SAGE in a more automatic fashion. This daemon will obtain the data from the universities databases and transfer the information to the SAGE server, making the process human independent.

5. Overview of Expert Seeker: Locating Experts at the National Aeronautics and Space Administration

This section presents insights and lessons learned from the development of Expert Seeker in (Becerra-Fernandez, 2000b), an organizational expertise-locator KMS that will be used to locate experts at the National Aeronautics and Space Administration (NASA). The NASA Faculty Awards for Research (FAR) for NASA-Kennedy Space Center (KSC), as well as the Center of Excellence in Space Data and Information Sciences (CESDIS) for Goddard Space Flight Center (GSFC), are funding the development of Expert Seeker at FIU KM Lab. Expert Seeker aims to help locate specialists within NASA-KSC and GSFC, and its use is expected to expand to other NASA Centers. The Expert Seeker KMS is accessed via NASA's Intranet. The Expert Seeker KMS provides a unified interface to access competencies available within the organization, such as completed academic and non-academic courses, past projects, and other relevant knowledge. This expertise-locator KMS will be especially useful when organizing cross-functional teams. The main interfaces on the query engine in

A stoplist is a group of words that are not considered to have any indexing value. These include common words such as "and", "the", and "there". 4 Inter-process communication (IPC) is when a local process (client) communicates with a remote process (server) across a network resulting in an exchange of data.

Expert Seeker uses text fields to search the proposed data for keywords, fields of expertise, names or other applicable search fields. Expert Seeker will offer NASA experts more visibility, and at the same time allow interested parties to identify available expertise within NASA. Because Expert Seeker aims to help locate intellectual capital within NASA, and is this particular characteristic that differentiates Expert Seeker from SAGE (the latter a KMS to find experts within Florida universities). Expert Seeker includes an interface to SAGE as well. In contrast with SAGE which is on the worldwide-web, Expert Seeker is accessible through only through NASA's the Intranet. Another important difference between SAGE and Expert Seeker is that the latter will enable the user to search for much more detailed information regarding the experts' achievements, including information such as intellectual property, skills and competencies, as well as the proficiency level for each of the skills and competencies.

competencies needs to be collected, to a large extent, through self-assessment. Furthermore, other related information deemed important in the generation of an expert profile which is not currently stored in an in-house database system can be user-supplied, such as employee's picture, project participation data, hobbies, and volunteer or civic activities.

7. The Web Text Mining Process

There are three types of web data mining. These are web structure mining, web usage mining, and web content mining. Web structure mining examines how the web documents themselves are structured. Web usage mining involves the identification of patterns in user navigation through web pages in a domain. Web content mining, is used to discover what a web page is about and how to uncover new knowledge from it. Web content mining is based on text mining and information retrieval (IR) techniques; which consist of the organization of large amounts of textual data for most efficient retrieval, an important consideration in handling text documents. IR techniques have become increasingly important, as the amount of semistructured as well as unstructured textual data present in organizations has increased dramatically. IR techniques provide a method to efficiently access these large amounts of information. One application of these methods is in the construction of expertise-locator KMS. A KMS system that locates experts based on published documents requires an automatic method for identifying employee names, as well as a method to associate employee names with skill keywords embedded in those documents. For this purpose, Expert Seeker required the development of a name-finding algorithm to identify names of NASA employees. Traditional IR techniques5 were then used to identify and match skill keywords with the identified employee names. An IR system typically uses as input a set of inverted files, which is a sequence of words that reference the group of documents the words appear in. These words are chosen according to a selection

5

6. The Technologies to Implement Expert Seeker

The development of Expert Seeker is being accomplished with the use of the following technologies: 1. Coding and Programming using Cold Fusion 4.0, Java Script, Active Server Pages ASP. 2. Database implementation with Microsoft SQL Server 7.0. 3. Search capabilities provided by Verity 4. GUI design with Adobe Phototshop 5.0 5. HTML and other web development tools The development of Expert Seeker requires the utilization of existing structured data as well as semi-structured and unstructured web-based information as much as possible. Expert Seeker uses the data in existing Human Resources databases for information such as employee's formal educational background, the X.500 directory for point-of-contact information, a skills database that profiles each employee's competency areas, and the Goal Performance Evaluation System (GPES). Information regarding skills and competencies, as well as proficiency levels for the skills and

See for example Selection by Discriminant Value in (Frakes and Baeza-Yates, 1992), an algorithm for selecting index terms.

algorithm that determines which words in the document are good index terms. In a traditional IR system, the user enters a query, and the system retrieves all documents that match that keyword entry. Expert Seeker is based on an IR technique that goes one step further. When a user enters a query, the system initially performs a document search based on user input. However, since the user is looking for experts in a specific subject area, the system returns the names of those employees whose names appear in the matching documents (excluding webmasters and curators). The employee name results are ranked according to the number of matching documents each individual name appears in. The employee information is then displayed to the user. The indexing process was carried out in four stages. First, all the relevant data was transferred to a local directory for further processing. In this case, the data included all the web pages on the NASA domain. This was done with a simple web-mirroring tool called Wget6. The second stage identifies all instances of employee names by programmatically examining each HTML file. The name data is taken from the X.500 personnel directory databases. All names in the employee database are organized into a map-like data structure beforehand that is used in the web content mining process. This map consists of all employee names referenced by their last name key. In addition, each full name is stored in every possible form it could appear. For example, the name John A. Smith is stored as John A. Smith, J. A. Smith, J. Smith, Smith, John A., Smith, J.A., and Smith J. An individual document is first searched for all last name keys. Subsequently, the document is again searched using all values of the matching keys. Name data organized in this way can increase the speed of the text search. Using one long sequence containing all names in every possible form as search criteria would slow down processing time. The third stage involves identifying keywords within the HTML content. This is done using a word frequency calculation. First the text is broken up into individual words, through string

6

pattern matching. Any sequence of alphabetical characters is recognized as a word while punctuation, numbers, and white space characters are ignored. The resulting list of words is processed to determine if a word was included in a stoplist. The resulting list of words was then processed with a stemming algorithm. A stemmer is used to remove the suffix of a word. This is done to group together words that may be spelled differently but have the same semantic meaning. A person who types "astronomical" as a query term would most likely also be interested in documents that match the term "astronomy". Once the stemming process is completed, the fourth stage involves calculating the frequency of each term. Word frequency was used during the keyword selection process in the determination of good index terms. However, other indexing algorithms could have been used instead with comparable results. It is important to note that the degree of relation between an employee name and a keyword within an individual document is not considered. Rather, expertise is determined based on the assumption that if an employee recurrently appears in many documents along with a keyword, then that person must have some knowledge of that term. Theoretically, a large document count for a search query should produce more accurate results. The chosen keywords have a twofold purpose. First, they are used to quickly associate employees with recurring skill terms. These keywords can also be used in future work for clustering similar documents into topic areas. Further work includes taxonomy construction from these keywords and the development of a query relevance feedback system that suggests query terms that are related to the query entered by the user.

8. Shortcomings of the use of Data Mining in Expertise-Locator KMS

Table 2 illustrates preliminary results for web mining accuracy and precision for a set of skill keyword query terms. Precision was determined by testing whether a keyword entered as a query term correctly describes the expertise for the corresponding employee. The precision values for the keywords in Table 2 are represented by the percentage of correct

http://www.gnu.org/software/wget/wget.html

matches within the top 15 results for the keyword. Recall was not calculated because it would be hard to determine if the names appearing in the NASA web documents completely reflected all the employees of that organization. The results show a high precision for scientific or research related skill terms, and less precision for the more managerial or administrative related skill terms. This may be due to the nature of the document body as being highly scientific and research oriented. These results show that the system can retrieve experts from the document body with a substantially high degree of precision, in particular for scientific and research related keywords.

Keyword Astrophysics Astronomy Comet Climate Ocean Atmosphere Management Human resources Precision (Top 15 results) 87% 92% 92% 92% 73% 87% 64% 53%

Table 2- Precision Results for Sample Skill Keyword Query

One of the shortcomings of this research relates to the fact that the accuracy provided by web data mining depends on the existence of employee web pages and their proper maintenance. Employee web pages must encode some minimal required level of content, including papers or technical documents published, collaborators in the case of multiauthor papers, as well as identifying the competencies represented in those papers. Another shortcoming of this research is the possible existence of multiple employees with the same name. Expert Seeker removes all repeat instances of the same name. All information referenced to a particular employee name was indexed without any attempt to distinguish between one of possibly several persons with the same name. Instances of multiple employees with the same name were handled at the time the user queried the

database. The results of the query are a list of hyperlinks referenced by employee names. When a user clicks a name the human resources data of that employee is displayed. In the case of multiple entries in the database for a name, the user is taken to an intermediary web page where a more detailed description of the employee is displayed. The user can then determine which of the employees is most likely to be an expert in the subject area that he is querying for. For example, if the query term is "astrophysics", the name of an employee who works in an astrophysics laboratory is more likely to be an expert in that area than a person who works in an administrative office. Another obstacle is the indexing of keywords that may not be relevant areas of specialization. It is important to note that the query search is performed on the results of an IR keyword indexing process rather than from a predetermined set of skill terms. The system is not designed to discriminate between keywords that describe a skill area and words that are good indexing terms but are not relevant for determining an area of expertise. However, we tend to think this is not a significant problem, because if the system is designed for the determination of expertise in an area of specialization, then it is highly unlikely that a user will enter a query that does not describe a skill area relevant to the areas of specialization within the NASA organization. This point could be argued on the basis of a potential language or terminology barrier between user and data. The flexibility of the indexing process, however, does provide an advantage: that a search can be performed using keywords that are not traditionally considered skill terms such as project names and highly specific technical terms. This particular shortcoming could become a significant issue when pursuing the idea of constructing knowledge taxonomies via means of clustering techniques, because keywords that are not skill-related could be inadvertently placed in the taxonomy, creating irregularities within the knowledge hierarchy.

9. Concluding Remarks and Future Work

Future developments for expertise-locator systems such as SAGE and Expert Seeker

include the development and integration of artificial intelligence (AI) technologies to enhance the capabilities of these systems. For example, data mining has shown to enhance the process of updating profiles by mining the authors of documents in an electronic repository and identifying a correspondence with the topic of the document. Authors of documents in an electronic repository are experts in those knowledge areas; therefore, the profile of the contributors to the repository could be automatically updated with keywords related to the subject matter contribution. This data mining effort would result in a diminished reliance on self-assessment. Web content mining can be useful and efficient for the construction of expertise-locator KMS in situations where a large text document body exists and contains relevant skill information. It provides a solution to some of the challenges faced in the development of these systems, including how to maintain up-to-date employee skill profiles while minimizing the need for cumbersome and possibly biased self-reporting. A prototype of Expert System's data mining functionality was demonstrated to NASA representatives received a positive assessment. The system is currently under further review for implementation and use within their organization. Furthermore, a data mining effort could be instrumental in clustering similar data objects together. For example, the data in SAGE is organized by Grant, and indexed by the Principal Investigator field. Through the use of a clustering tool (Schurr et. al., 1999), data can be grouped into clusters of expertise, to reveal expertise areas that may not be currently defined. In the case of Expert Seeker, grouping of experts within KSC with complementing expertise areas would result in virtual "centers of excellence". This effort could reveal areas of strength that could otherwise go unnoticed in the organization. AI Technologies are already having an important impact in the development of KMS such as People-Finder applications. Additional developments in this area will be instrumental in the development of organizational training programs, that may be designed to address the gap between what "is known" and what "needs-to-be-known" in the organization.

In conclusion, our vision of People-Finder KMS fits well with the work to develop systems that seek to create an IT-support environment for knowledge workers. This is done through the use of intelligent assistants in a business process environment; keeping in mind that "an IT tool may only act as a facilitator for sharing, creating or retrieving knowledge, but never as a key player in creating, evaluating or contributing knowledge" (Tristram, 1998).

Acknowledgements

The author wishes to acknowledge NASAKennedy Space Center and NASA-Headquarters under the Faculty Awards for Research (FAR99), grant number NAG10-0259, as well as NASA-Goddard Space Flight Center and the CESDIS, contract number NAS5-32337 and subcontract number 5555-97-74, for financial support for the development of Expert Seeker. The author also wishes to acknowledge NASAKSC and Bethune-Cookman College, for financial support in the development of SAGE, under the auspices of the "Florida Minority Institution Entrepreneurial Partnership Grant", grant number NAGO-0220. Special thanks to all NASA employees that collaborated in this effort, including Mr. Gregg Buckingham, Mr. Chris Carlson, Mr. Steve Chance, Dr. Milt Halem, Dr. Susan Hoban, Mr. James Jennings, Ms. Nancy Laubenthal, Dr. Shannon Roberts, and Mr. Pat Simpkins. The authors also wish to acknowledge the contributions of the students who work in the FIU Knowledge Management Lab and who collaborated in this research, specifically Bertha Correa, Jaclyn Don, Rigoberto Fernandez, Alan Harrylal, Luis Felipe Villegas, Hector Hartmann, Jorge Lores, Lusally Mui, Thomas Pla, Maria Ray, Juan Rodriguez, and Hernan Santiesteban.

References

Ahonen, H., Heinonen, O., Klemettinen, M., and Verkamo, I. (1997): Applying Data Mining Techniques in Text Analysis, Technical Report, C-1997-23, Dept. of Computer Science, University of Helsinki. Alavi, M., Leidner, D. (1999). Knowledge Management Systems: Issues, challenges,

and benefits. Communications of the Association for Information Systems [online], 1. Available: http://cais.isworld.org/articles/default.asp?v ol=1&art=7 (November 1999). Becerra-Fernandez, I. (1998a). Corporate Memory Project, Final Report, NASA grant No. NAG10-0232, 12-25. Becerra-Fernandez, I. and Aha, D. (1999a) Case-Based Problem Solving for Knowledge Management Systems. In Proceedings of the Twelfth Annual International Florida Artificial Intelligence Research symposium (FLAIRS) Orlando, Florida: Knowledge Management Track. Becerra-Fernandez, I. (1999b) Searchable Answer Generating Environment (SAGE): A Knowledge Management System for Searching for Experts in Florida. Proceedings of the 12th Annual International Florida Artificial Intelligence Research Symposium (FLAIRS) - Knowledge Management Track, Orlando, Florida (May, 1999). Becerra-Fernandez, I. (2000a). The Role of Artificial Intelligence Technologies in the Implementation of People-Finder Knowledge Management Systems. Knowledge Based Systems, special issue on Artificial Intelligence in Knowledge Management, Vol. 13: No. 5, (October 2000). Becerra-Fernandez, I. (2000b). Facilitating the Online Search of Experts at NASA using Expert Seeker People-Finder. In Proceedings of the Third International Conference on Practical Aspects of Knowledge Management. Basel, Switzerland . Becerra-Fernandez, I., and Stevenson, J.M. 2000. Knowledge Management Systems& Solutions for the School Principal as Chief Learning Officer. Forthcoming. Cañas, A., Leake, D., Wilson, D. (1999). Managing, Mapping, and Manipulating

Conceptual Knowledge. In Proceedings of the AAAI-99 Workshop on Exploring Synergies of Knowledge Management and Case Based Reasoning, 10-14. Menlo Park, AAAI Press. Davenport, T. Knowledge Management at Hewlett Packard (1996). Available online at: http://www.bus.utexas.edu/kman/hpcase.htm . Davenport, T. (1997). Knowledge Management Case Study: Knowledge Management at Microsoft. Available online at: http://kman.bus.utexas.edu/kman/microsoft. htm Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurussmy, R. eds. (1996). Advances in Knowledge Discovery and Data Mining: AAAI Press. Frakes, W., and Baeza-Yates, R. (1992). Information Retrieval: DataStructures and Algorithms. Upper Saddle, NJ: Prentice Hall. Nissen, M.E. (2000). Knowledge Based Knowledge management in the Reengineering Domain. Forthcoming. Schurr H., Sttab S., and Studer R. Ontologybased Process Support. Proceedings of the AAAI Workshop on Exploring Synergies of Knowledge Management and Case-Based Reasoning, Orlando, Florida (July, 1999). Tristram, C. Common Knowledge (1998). CIO. Available online at www.cio.com/archive/webbusiness/090198_ booz.html

Information

Searching for Experts with Expertise-Locator

8 pages

Report File (DMCA)

Our content is added by our users. We aim to remove reported files within 1 working day. Please use this link to notify us:

Report this file as copyright or inappropriate

996594


Notice: fwrite(): send of 203 bytes failed with errno=104 Connection reset by peer in /home/readbag.com/web/sphinxapi.php on line 531