RapidMiner HHHPlugin
The RapidMiner HHHPlugin contains Java implementations of the Hierarchical Heavy Hitter stream algorithms as described in (Cormode et al., 2008).
Download
HHHPlugin.zip (12M)
Note that the plugin was only tested with RapidMiner version 5.0.008 and that it won't run or compile with any earlier version!
The archive contains a precompiled version of the plugin. You need to copy the file rapidminer-Hierarchical Heavy Hitters-5.0.003.jar into RapidMiner's lib/plugins folder and restart RapidMiner.
To compile and install the plugin yourself, import the extracted folder into Eclipse and run the Ant build script. You may have to specify the location of your RapidMiner installation in the build.xml
file.
If you have any questions or problems, please contact the plugin maintainer Marco Stolpe.
Development status
Originally developed by Peter Fricke during his diploma thesis, the plugin was ported to RapidMiner 5 and is now maintained by Marco Stolpe. Currently, the plugin has the following features:
- FullAncestry and PartialAncestry from the hitters.* packages can easily be used in your own code. The algorithms are independent from a particular domain.
- The plugin contains a parser for 54 different types of system calls in strace log files.
- A feature extractor can translate system call data into 3-dimensional hierarchical variables. For the call types, the translation is based on a hand-crafted taxonomy, which leads to a feature-rich representation containing information about the functional category, the name of a call, parameter types and values.
- The main operator, Extract HHHs plain, wraps the HHH algorithms and can approximate the most frequent combinations of call types, accessed file system paths and recent call history.
- The internal data structures (prefixes and their frequencies) of log files can be stored in an ExampleSet and compared to each other by several distance measures for sets of hierarchical elements.
It is planned to release a more domain-independent HHH operator in the near future. We're also eagerly working on a documentation in English beyond the existing German comments in code.
Documentation and literature
There is no written documentation yet. To get running, you can find an example experiment and data sets below. Moreover, you can read about the plugin, experiments, data sets and algorithms in the following publications:
- Fricke, P. and M. Stolpe: Implementing Hierarchical Heavy Hitters in RapidMiner: Solutions and Open Questions. In: Proceedings of the RapidMiner Community Meeting and Conference (RCOMM 2010), 2010.
- Fricke, P., F. Jungermann, K. Morik, N. Piatkowski and O. Spinczyk: Towards Adjusting Mobile Devices to User's Behaviour. In: Proceedings of the International Workshop on Mining Ubiquitous and Social Environments (MUSE 2010), pp. 7--22, 2010.
- Fricke, P.: Datenaggregation von Betriebssystemdaten durch Hierarchical Heavy Hitters. TU Dortmund, LS 8, Diplomarbeit, 2010.
- Cormode, G., F. Korn, S. Muthukrishnan, D. Srivastava: Finding Hierarchical Heavy Hitters in Streaming Data. In: ACM Trans. Knowl. Discov. Data (1):4, pp. 1-48, 2008.
- Cormode, G., F. Korn, S. Muthukrishnan, D. Srivastava: Diamond in the Rough: Finding Hierarchical Heavy Hitters in Multi-dimensional Data. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 155-166, 2004.
- Ganesan, P., H. Garcia-Molina, J. Widom: Exploiting Hierarchical Domain Structure to Compute Similarity. In: ACM Transactions on Information Systems (21):1, pp. 64-93, 2003.
Data sets and example process
Data of strace logs for eleven Linux applications: Download (101M)
Each file splitted into 30 parts: Download (103M)
To process the data in RapidMiner, here are accompanying .csv
files that contain references to the log files:
allsplit30.csv
firstlogs.csv
Note that you have to change all file path prefixes in such files to the corresponding paths in your own environment.
To create new strace data, under Linux you should be able to use the following command from the command line:
strace -f -p
pid -o
logfile
You may use the following example experiment to test the Hierarchical Heavy Hitter operator:
hhh_classification.xml