Keystroke Dynamics - Benchmark Data Set
Accompaniment to "Comparing Anomaly-Detection Algorithms for Keystroke Dynamics" (DSN-2009)
Author: Kevin Killourhy and Roy Maxion (Contact Me)
This webpage is a benchmark data set for keystroke dynamics. It is a supplement to the paper "Comparing Anomaly-Detection Algorithms for Keystroke Dynamics," by Kevin Killourhy and Roy Maxion, published in the proceedings of the DSN 2009 conference [1]. The webpage is organized as follows:
- 1. Introduction: About this webpage
- 2. The Data: Timing data for 51 typists
- 3. Evaluation Script: Script for evaluating 3 anomaly detectors
- 4. Table of Results: Error-rate results for the detectors
- 5. References: Relevant material and acknowledgments
Sections 1 – 4 each consist of a brief explanation of their contents, followed by a list of common questions that provide more detail about the material. Click on a question to show the answer, or display all answers by clicking on:
1. Introduction
On this webpage, we share the data, scripts, and results of our evaluation so that other researchers can use the data, reproduce our results, and extend them; or, use the data for investigations of related topics, such as intrusion, masquerader or insider detection. We hope these resources will be useful to the research community.
Common Questions
Click to expand or collapse the answers.
-
Q1-1: What is keystroke dynamics (or keystroke biometrics)?Keystroke dynamics is the study of whether people can be distinguished by their typing rhythms, much like handwriting is used to identify the author of a written text. Possible applications include acting as an electronic fingerprint, or in an access-control mechanism. A digital fingerprint would tie a person to a computer-based crime in the same manner that a physical fingerprint ties a person to the scene of a physical crime. Access control could incorporate keystroke dynamics both by requiring a legitimate user to type a password with the correct rhythm, and by continually authenticating that user while they type on the keyboard.
-
Q1-2: What is your paper about? What is this webpage for?To make measurable progress in the field of keystroke dynamics, shared data and shared evaluation methods are necessary. In our paper, we describe a methodology by which timing data was collected and used to evaluate 14 anomaly detectors (e.g., based on support vector machines, neural networks, and fuzzy logic). The anomaly-detection task was to discriminate between the typing of a genuine user trying to gain legitimate access to his or her account, and the typing of an impostor trying to gain access illegitimately to that same account. Our intent with this webpage is to share our resources—the typing data, the evaluation script, and the table of results—with the research community, and to answer questions that they (or you) might have.
-
Q1-3: Where can I find a copy of the paper?The paper is included in the Proceedings of the 39th Annual Dependable Systems and Networks (DSN-2009) Conference, published by IEEE Press [1].
-
Q1-4: How would I cite this webpage in a publication?This webpage is a supplement to our original paper in DSN-2009, and the paper refers readers to this website. Consequently, we ask authors who find this web resource useful provide a citation to the original paper [1]. See the citation in Section 5, References, at the end of this web page.
2. The Data
The data consist of keystroke-timing information from 51 subjects (typists), each typing a password
(.tie5Roanl
) 400 times.
Common Questions
-
Q2-1: How were the data collected?For complete details of our data collection methodology, we refer readers to our original paper [1]. A brief summary of our methodology follows.
We built a keystroke data-collection apparatus consisting of: (1) a laptop running Windows XP; (2) a software application for presenting stimuli to the subjects, and for recording their keystrokes; and (3) an external reference timer for timestamping those keystrokes. The software presents the subject with the password to be typed. As the subject types the password, it is checked for correctness. If the subject makes a typographical error, the application prompts the subject to retype the password. In this manner, we record timestamps for 50 correctly typed passwords in each session.
Whenever the subject presses or releases a key, the software application records the event (i.e., keydown or keyup), the name of the key involved, and a timestamp for the moment at which the keystroke event occurred. An external reference clock was used to generate highly accurate timestamps. The reference clock was demonstrated to be accurate to within ±200 microseconds (by using a function generator to simulate key presses at fixed intervals).
We recruited 51 subjects (typists) from within a university community; all subjects fully completed the study—we did not drop any subjects. All subjects typed the same password, and each subject typed the password 400 times over 8 sessions (50 repetitions per session). They waited at least one day between sessions, to capture some of the day-to-day variation of each subject's typing. The password (.tie5Roanl
) was chosen to be representative of a strong 10-character password.
The raw records of all the subjects' keystrokes and timestamps were analyzed to create a password-timing table. The password-timing table encodes the timing features for each of the 400 passwords that each subject typed.
-
Q2-2: How do I read the data into R / Matlab / Weka / Excel / ...?The data are provided in three different formats to make it easier for researchers visiting this page to view and manipulate the data. In all its forms, the data are organized into a table, but different applications are better suited to different formats.
(R): In the fixed-width format, the columns of the table are separated by one or more spaces so that the information in each column is aligned vertically. This format is easy to read in a standard web browser or document editor with a fixed width font. It can also be read by the standard data-input mechanisms of the statistical-programming environment R. Specifically, theread.table
command can be used to read the data into a structure called adata.frame
:
X <-read.table( 'DSL-StrongPasswordData.txt', header=TRUE )
(Excel): In the Microsoft Excel binary-file format, the columns of the table are encoded as a standard Excel spreadsheet. This format can be used by researchers wishing to bring Excel's data analysis and graphing capabilities to bear on the data.
By making the data available in these three formats, we hope to make it easier for other researchers to use their preferred data-analysis tools. In our own research, we use the fixed-width format and the R statistical-programming environment. -
Q2-3: How are the data structured? What do the column names mean? (And why aren't the subject IDs consecutive?)The data are arranged as a table with 34 columns. Each row of data corresponds to the timing information for a single repetition of the password by a single subject. The first column, subject, is a unique identifier for each subject (e.g.,
s002
ors057
). Even though the data set contains 51 subjects, the identifiers do not range froms001
tos051
; subjects have been assigned unique IDs across a range of keystroke experiments, and not every subject participated in every experiment. For instance, Subject 1 did not perform the password typing task and sos001
does not appear in the data set. The second column, sessionIndex, is the session in which the password was typed (ranging from 1 to 8). The third column, rep, is the repetition of the password within the session (ranging from 1 to 50).
The remaining 31 columns present the timing information for the password. The name of the column encodes the type of timing information. Column names of the formH.key
designate a hold time for the named key (i.e., the time from when key was pressed to when it was released). Column names of the formDD.key1.key2
designate a keydown-keydown time for the named digraph (i.e., the time from when key1 was pressed to when key2 was pressed). Column names of the formUD.key1.key2
designate a keyup-keydown time for the named digraph (i.e., the time from when key1 was released to when key2 was pressed). Note thatUD
times can be negative, and thatH
times andUD
times add up toDD
times.
Consider the following one-line example of what you will see in the data:subject sessionIndex rep H.period DD.period.t UD.period.t ... s002 1 1 0.1491 0.3979 0.2488 ...
period
key was held down for 0.1491 seconds (149.1 milliseconds); the time between pressing theperiod
key and thet
key (keydown-keydown time) was 0.3979 seconds; the time between releasing theperiod
and pressing thet
key (keyup-keydown time) was 0.2488 seconds; and so on.
3. Evaluation Scripts
The following procedure—written in the R language for statistical computing (www.r-project.org) — demonstrates how to use the data to evaluate three anomaly detectors (called Euclidean, Manhattan, and Mahalanobis).
Note that this script depends on the R package ROCR for generating ROC curves [2].
Common questions
-
Q3-1: What does the script really do? Can you explain the steps of the evaluation?
For complete details of our evaluation methodology, and a clear explanation of our design decisions, we refer readers to our original paper [1]. A brief summary of our evaluation methodology follows.
The following four steps are used to evaluate a single anomaly detector on the task of discriminating a single subject (designated as the genuine user) from the other 50 subjects (designated as the impostors). After evaluating the detector for a single subject, these four steps will be repeated for each subject in the data set, so that each subject, in turn, will have been "attacked" by each of the other 50 subjects in a balanced experimental design.
Step 1 (training): Retrieve the first 200 passwords typed by the genuine user from the password-timing table. Use the anomaly detector's training function with these password-typing times to build a detection model for the user's typing.
Step 2 (genuine-user testing): Retrieve the last 200 passwords typed by the genuine user from the password-timing table. Use the anomaly detector's scoring function and the detection model (from Step 1) to generate anomaly scores for these password-typing times. Record these anomaly scores as user scores.
Step 3 (impostor testing): Retrieve the first 5 passwords typed by each of the 50 impostors (i.e., all subjects other than the genuine user) from the password-timing table. Use the anomaly detector's scoring function and the detection model (from Step 1) to generate anomaly scores for these password-typing times. Record these anomaly scores as impostor scores.
Step 4 (assessing performance): Employ the user scores and impostor scores to generate an ROC curve for the genuine user. Calculate, from the ROC curve, an equal-error rate, that is, the error rate corresponding to the point on the curve where the false-alarm (false-positive) rate and the miss (false-negative) rate are equal.
Repeat the above four steps, designating each of the subjects as the genuine user in turn, and calculating the equal-error rate for the genuine user. Calculate the mean of all 51 subjects' equal-error rates as a measure of the detector's performance, and calculate the standard deviation as a measure of its variance across subjects.
-
Q3-2: How do I download R / install packages / run the script?
You can download R from the webpage for the R Project for Statistical Computing (http://www.r-project.org). The R statistical-programming environment is a general programming language with many functions and packages for conducting a range of statistical analyses and data visualizations. It is available for most modern operating systems, and it is free and open-source. We developed and tested our evaluation script with R version 2.6.2, but we expect that it will work with similar versions of R. If you are not familiar with R, there are many tutorials and references available online. The following is a collection of some that we have used, or that have been recommended to us:
- Introduction to the Statistical Language R (by Myron Hylinka)
- A Skimpy Intro to R/S/S-Plus (by Thomas Fletcher)
- A Brief History of S (by Richard Becker)
- R Tutorial (by Kelly Black)
- An Introduction to R (by the R Development Core Team)
- R Language Definition (by the R Development Core Team)
Once you have installed R and have become familiar with how to use it, the next step is to install an additional package (called ROCR) that is necessary for running our evaluation. The R project maintainers have organized a large collection of packages, and have made it easy to download and install these packages. The necessary package can be installed with a single R command:
install.packages( 'ROCR' )
The final step is to download and install the evaluation script from this webpage, and to place it in the same directory as the data in fixed-width format. To run the script, use the R commandsource
:
source('evaluation-script.R')
If you have installed R correctly, installed the appropriate packages, and run the evaluation script successfully, it should print information with which you can monitor the progress of the evaluation. Eventually, it should tally and print the following results for the three anomaly detectors:eer.mean eer.sd Euclidean 0.171 0.095 Manhattan 0.153 0.092 Mahalanobis 0.110 0.065
Note that these results are fractional rates between 0.0 and 1.0 (not percentages between 0% and 100%). They match the average equal-error rates and standard deviations for the detectors from Table 2 of our original paper (and reproduced in the table of results, below). By running this script successfully, you will have replicated our evaluation methodology and reproduced our results for these three detectors. -
Q3-3: Why does the script only have code for three anomaly detectors?The purpose of this webpage is to share the data and the evaluation methodology that were the original contributions of our paper, not to provide and support code for all 14 anomaly detectors. In our original paper, we describe each of the 14 detectors, and we provide references to the original sources. We encourage researchers who are interested in replicating those detectors to use that material.
We implemented these three anomaly detectors because they are good examples with which to demonstrate our evaluation methodology. They are relatively easy to understand, since they are based on classical measures of distance from the statistical machine-learning and pattern-recognition literature. They are easy to implement, since they do not depend on packages or algorithms not found in a typical R installation. Finally, they are easy to run, since they do not require complex optimizations in order to run efficiently.
Note that—in the interest of scientific progress—we see a benefit in maintaining reference implementations of the top-performing detectors. Such reference implementations could be used, evaluated, and improved by the whole community. We are investigating the feasibility of sharing and supporting such reference implementations (and wholly encourage others to do so as well) in the future, but are not able to do so at the present time. -
Q3-4: What other kinds of anomaly detectors can be evaluated using these scripts?Each of the anomaly detectors in our comparison was comprised of two functions, a training function and a scoring function. The training function takes a matrix of password-timing information as input, and it outputs a detection model. Each row of the input matrix encodes password-timing information from one repetition of the genuine user typing the password. The function uses this set of timing information to build a model of that user's typing. The details of the model are detector specific, and they need not take a particular form for our evaluation.
The scoring function takes the detection model produced by the training function and another matrix of password-timing information as input. It outputs a set of anomaly scores. The scoring function compares the timing information from each password in the matrix to the genuine user's typing model. For each password, it calculates an anomaly score, indicating the degree to which that new sample is dissimilar from the typing model. A higher anomaly score means greater dissimilarity according to that anomaly detector's conception of similarity.
The commonality across all the anomaly detectors is that each one can be implemented as a training and a scoring function. Any other anomaly detector which can be implemented as such a pair of functions with the same types of input and output can be evaluated using our methodology. If a new detector were implemented in R as the functionsnewTrain
andnewScore
, it could be evaluated simply by adding these two functions to thedetectorSet
list of detectors:detectorSet = list( NewDetector = list( train = newTrain, score = newScore ) );
Our intent in sharing the data is for the password-timing tables to be used to evaluate a range of anomaly detectors so that the results of the evaluations can be soundly compared, using the same data and the same evaluation procedure. Consequently, we encourage other researchers to use our evaluation script to evaluate new and better anomaly-detection strategies for keystroke dynamics. -
Q3-5: What if I want to do a different evaluation using the data?The data, with or without the evaluation methodology, is intended to be a shared resource for public use. Consequently, if researchers would like to do different evaluations using the data, they are welcome to do so. For instance, while our study has focused on anomaly detectors, other researchers have considered binary and multi-class classifiers for keystroke dynamics. For instance, a binary classifier might be trained to discriminate between two typists, or between one typist and a pool of typing data comprised of many other typists; a multi-class classifier might be trained to identify which of several typists entered a particular typing sample. The data shared on this website could be used to evaluate any of these alternative families of learning algorithms.
Caution: we have one request to researchers using this data, but using a different evaluation methodology. Please, make it very clear that your methodology differs from the one in our paper, and clearly describe your alternative methodology. We have observed that algorithms evaluated under different conditions are often compared, even though the differing evaluation environments represent a serious potential confound. By clearly explaining your evaluation methodology, and how it differs from others, you mitigate the risk of a confused reader conflating the different methodologies, and making unsound comparisons.
4. Table of Results
The following table ranks 14 anomaly detectors based on their average equal-error rates. The evaluation procedure described in the script above was used to obtain the equal-error rates for each anomaly detector. For example, the average equal-error rate for the scaled Manhattan detector (across all subjects) was 9.62%, and the standard deviation was 0.0694.
Detector | Average Equal-Error Rate (stddev) |
---|---|
Manhattan (scaled) | 0.0962 (0.0694) |
Nearest Neighbor (Mahalanobis) | 0.0996 (0.0642) |
Outlier Count (z-score) | 0.1022 (0.0767) |
SVM (one-class) | 0.1025 (0.0650) |
Mahalanobis | 0.1101 (0.0645) |
Mahalanobis (normed) | 0.1101 (0.0645) |
Manhattan (filter) | 0.1360 (0.0828) |
Manhattan | 0.1529 (0.0925) |
Neural Network (auto-assoc) | 0.1614 (0.0797) |
Euclidean | 0.1706 (0.0952) |
Euclidean (normed) | 0.2153 (0.1187) |
Fuzzy Logic | 0.2213 (0.1051) |
k Means | 0.3722 (0.1391) |
Neural Network (standard) | 0.8283 (0.1483) |
Note that these are results are fractional rates between 0.0 and 1.0 (not percentages between 0% and 100%).
Common Questions
-
Q4-1: How do I interpret this table of results?The first column indicates the name of the detector. All of the detectors in this list bear the names they were given in our original paper. The names are meant to describe the mathematical or statistical technique that underlies the anomaly-detection strategy. The reader interested in how each detector works will find additional detail in the prose and references of the original paper.
The second column provides the average equal-error rate of each detector, as estimated by our evaluation methodology. The standard deviation appears in parentheses. The detectors are sorted from least to highest error.
Note that these results and rankings are only the observed results of a single evaluation on a single data set. We would discourage a reader from inferring that the top-ranked detectors are necessarily going to always outperform the other detectors. We believe it is likely that many factors—who the subjects are, what they type, and specifically how the data are collected and analyzed—affect the error rates of anomaly detectors used for keystroke dynamics. Variations in these factors might change a detector's equal-error rate, and might cause a different set of detectors to be among the top performers. A high rank in this table suggests that a detector is promising; but more data, and more evaluations will be needed to determine how various factors affect keystroke-dynamics error rates. This topic is a subject of our current and ongoing research. -
Q4-2: Why do you use the average equal-error rate as the sole measure of performance?Summarizing the performance of an anomaly detector as a single number is tricky. There is no right way to do it that does not make some concessions, or have some drawbacks. As such, researchers in the field have used a variety of measures. In the original paper, we reported both the equal-error rate and the zero-miss false-alarm rate, since they are both used in the literature. On this webpage, we tabulate the equal-error rate of each detector because it is a common measure of performance for many biometric systems. If other, demonstrably better measures of performance emerge, we will consider the feasibility of calculating them on our evaluation data, and updating this page with these better measures.
-
Q4-3: Do you plan to update the table with new results?
It has been suggested that we maintain a "scoreboard" of the latest and best results. Insofar as we are informed of the results obtained by other investigators, we may do so. We intend to assemble and maintain a list of research projects that use and extend our results. Researchers might use this reference to compare and build upon each other's work.
The data and evaluation procedure are freely available for use. We do ask, as a courtesy, that you let us know if you publish results based on our data.
5. References
- 1
-
Kevin S. Killourhy and Roy A. Maxion. "Comparing Anomaly Detectors for Keystroke Dynamics," in Proceedings of the 39th Annual International Conference on Dependable Systems and Networks (DSN-2009), pages 125-134, Estoril, Lisbon, Portugal, June 29-July 2, 2009. IEEE Computer Society Press, Los Alamitos, California, 2009. (pdf)
- 2
-
T. Sing, O. Sander, N. Beerenwinkel, T. Lengauer. "ROCR: visualizing classifier performance in R," Bioinformatics 21(20):3940-3941 (2005). (link)