Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records. A data stream is an ordered sequence of instances that in many applications of data stream mining can be read only once or a small number of times using limited computing and storage capabilities.
In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Machine learning techniques can be used to learn this prediction task from labeled examples in an automated fashion. Often, concepts from the field of incremental learning are applied to cope with structural changes, on-line learning and real-time demands. In many applications, especially operating within non-stationary environments, the distribution underlying the instances or the rules underlying their labeling may change over time, i.e. the goal of the prediction, the class to be predicted or the target value to be predicted, may change over time. This problem is referred to as concept drift. Detecting concept drift is a central issue to data stream mining. Other challenges that arise when applying machine learning to streaming data include: partially and delayed labeled data, recovery from concept drifts, and temporal dependencies.
Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of data mining, machine learning, and knowledge discovery.
Software for data stream mining
- MOA (Massive Online Analysis): free open-source software specific for mining data streams with concept drift developed in Java. It has several machine learning algorithms (classification, regression, clustering, outlier detection and recommender systems). Also, it contains a prequential evaluation method, the EDDM concept drift methods, a reader of ARFF real datasets, and artificial stream generators as SEA concepts, STAGGER, rotating hyperplane, random tree, and random radius based functions. MOA supports bi-directional interaction with Weka (machine learning).
- scikit-multiflow: A machine learning framework for multi-output/multi-label and stream data implemented in Python. scikit-multiflow contains stream generators, stream learning methods for single-target and multi-target, concept drift detectors, evaluation and visualisation methods. (This software is discontinued)
- StreamDM: StreamDM is an open source framework for big data stream mining that uses the Spark Streaming extension of the core Spark API. One advantage of StreamDM in comparison to existing frameworks is that it directly benefits from the Spark Streaming API, which handles much of the complex problems of the underlying data sources, such as out of order data and recovery from failures.
- RapidMiner: commercial software for knowledge discovery, data mining, and machine learning also featuring data stream mining, learning time-varying concepts, and tracking drifting concept (if used in combination with its data stream mining plugin (formerly: Concept Drift plugin))
- RiverML: River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data.
- GAENARI: C++ incremental decision tree. It continuously executes inserts and updates of chunked data sets. Rebuild support for concept drift issues.
Events
- International Workshop on Ubiquitous Data Mining Archived 2013-02-23 at the Wayback Machine held in conjunction with the International Joint Conference on Artificial Intelligence (IJCAI) in Beijing, China, August 3–5, 2013.
- International Workshop on Knowledge Discovery from Ubiquitous Data Streams Archived 2012-02-16 at the Wayback Machine held in conjunction with the 18th European Conference on Machine Learning (ECML) and the 11th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) in Warsaw, Poland, in September 2007.
- ACM Symposium on Applied Computing Data Streams Track held in conjunction with the 2007 ACM Symposium on Applied Computing (SAC-2007) in Seoul, Korea, in March 2007.
- IEEE International Workshop on Mining Evolving and Streaming Data (IWMESD 2006) to be held in conjunction with the 2006 IEEE International Conference on Data Mining (ICDM-2006) in Hong Kong in December 2006.
- Fourth International Workshop on Knowledge Discovery from Data Streams (IWKDDS) to be held in conjunction with the 17th European Conference on Machine Learning (ECML) and the 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (ECML/PKDD-2006) in Berlin, Germany, in September 2006.
See also
- Concept drift
- Data Mining
- Sequence mining
- Streaming algorithm
- Stream processing
- Wireless sensor network
- Lambda architecture
Books
- Bifet, Albert; Gavaldà, Ricard; Holmes, Geoff; Pfahringer, Bernhard (2018). Machine Learning for Data Streams with Practical Examples in MOA. Adaptive Computation and Machine Learning. MIT Press. p. 288. ISBN 9780262037792.
- Gama, João; Gaber, Mohamed Medhat, eds. (2007). Learning from Data Streams: Processing Techniques in Sensor Networks. Springer. p. 244. doi:10.1007/3-540-73679-4. ISBN 9783540736783.
- Ganguly, Auroop R.; Gama, João; Omitaomu, Olufemi A.; Gaber, Mohamed M.; Vatsavai, Ranga R., eds. (2008). Knowledge Discovery from Sensor Data. Industrial Innovation. CRC Press. p. 215. ISBN 9781420082326.
- Gama, João (2010). Knowledge Discovery from Data Streams. Data Mining and Knowledge Discovery. Chapman and Hall. p. 255. ISBN 9781439826119.
- Lughofer, Edwin (2011). Evolving Fuzzy Systems - Methodologies, Advanced Concepts and Applications. Studies in Fuzziness and Soft Computing. Vol. 266. Heidelberg: Springer. p. 456. doi:10.1007/978-3-642-18087-3. ISBN 9783642180866.
- Sayed-Mouchaweh, Moamar; Lughofer, Edwin, eds. (2012). Learning in Non-Stationary Environments: Methods and Applications. New York: Springer. p. 440. CiteSeerX 10.1.1.709.437. doi:10.1007/978-1-4419-8020-5. ISBN 9781441980199.
References
- ^ Gomes, Heitor M.; Bifet, Albert; Read, Jesse; Barddal, Jean Paul; Enembreck, Fabrício; Pfharinger, Bernhard; Holmes, Geoff; Abdessalem, Talel (2017-10-01). "Adaptive random forests for evolving data stream classification". Machine Learning. 106 (9): 1469–1495. doi:10.1007/s10994-017-5642-8. hdl:10289/11231. ISSN 1573-0565.
- Medhat, Mohamed; Zaslavsky; Krishnaswamy (2005-06-01). "Mining data streams". ACM SIGMOD Record. 34 (2): 18–26. doi:10.1145/1083784.1083789. S2CID 705946.
- Lemaire, Vincent; Salperwyck, Christophe; Bondu, Alexis (2015), Zimányi, Esteban; Kutsche, Ralf-Detlef (eds.), "A Survey on Supervised Classification on Data Streams", Business Intelligence: 4th European Summer School, eBISS 2014, Berlin, Germany, July 6–11, 2014, Tutorial Lectures, Lecture Notes in Business Information Processing, Springer International Publishing, pp. 88–125, doi:10.1007/978-3-319-17551-5_4, ISBN 978-3-319-17551-5
- Webb, Geoffrey I.; Lee, Loong Kuan; Petitjean, François; Goethals, Bart (2017-04-02). "Understanding Concept Drift". arXiv:1704.00362 .
- Gama, João; Žliobaitė; Bifet; Pechenizkiy; Bouchachia (2014-03-01). "A survey on concept drift adaptation" (PDF). ACM Computing Surveys. 46 (4): 1–37. doi:10.1145/2523813. S2CID 207208264.
- Gomes, Heitor Murilo; Read; Bifet; Barddal; Gama (2019-11-26). "Machine learning for streaming data". ACM SIGKDD Explorations Newsletter. 21 (2): 6–22. doi:10.1145/3373464.3373470. S2CID 208607941.
- Gomes, Heitor Murilo; Grzenda, Maciej; Mello, Rodrigo; Read, Jesse; Le Nguyen, Minh Huong; Bifet, Albert (2022-02-28). "A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams". ACM Computing Surveys. 55 (4): 1–42. arXiv:2106.09170. doi:10.1145/3523055. ISSN 0360-0300.
- Grzenda, Maciej; Gomes, Heitor Murilo; Bifet, Albert (2019-11-16). "Delayed labelling evaluation for data streams". Data Mining and Knowledge Discovery. 34 (5): 1237–1266. doi:10.1007/s10618-019-00654-y. ISSN 1573-756X.
- Žliobaitė, Indrė; Bifet, Albert; Read, Jesse; Pfahringer, Bernhard; Holmes, Geoff (2015-03-01). "Evaluation methods and decision theory for classification of streaming data with temporal dependence". Machine Learning. 98 (3): 455–482. doi:10.1007/s10994-014-5441-4. hdl:10289/8954. ISSN 1573-0565.
- Montiel, Jacob; Read, Jesse; Bifet, Albert; Abdessalem, Talel (2018). "Scikit-Multiflow: A Multi-output Streaming Framework". Journal of Machine Learning Research. 19 (72): 1–5. arXiv:1807.04662. Bibcode:2018arXiv180704662M. ISSN 1533-7928.
- Features, scikit-multiflow, 2021-10-09, retrieved 2021-10-11
- Zaharia, Matei; Das, Tathagata; Li, Haoyuan; Hunter, Timothy; Shenker, Scott; Stoica, Ion (2013). "Discretized streams". Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. New York, New York, USA: ACM Press. pp. 423–438. doi:10.1145/2517349.2522737. ISBN 978-1-4503-2388-8.
- online-ml/river, OnlineML, 2021-10-11, retrieved 2021-10-11