On-the-fly data clustering for the PostgreSQL database management system

Journal: Software & Systems (Vol.36, No. 2)

Publication Date: 2023-06-16

Authors : Tatarnikova T.M.;

Page : 196-201

Keywords : PostgreSQL; centroid method; dynamic link library; DBMS; clustering;

Source : Download Find it from : Google Scholar

Abstract

The paper determines the relevance of the task of real-time data clustering in the form of a dynamically embedded library for the PostgreSQL open-source database management system. There are formulated conditions for performing real-time clustering, which consist in ensuring sufficient performance, in which the time for determining clusters does not exceed the time for writing data to the table and a limited amount of data for clustering. PostgreSQL methods are available in the devel-library, which allows them to be used to interact with data at the internal representation level and other programming languages that perform some operations faster than the SQL query language. The scheme of interaction between elements for clustering includes a database with a dynamically embedded library and the TimescaleDB extension to organize data storage by the database server; an interpreter – a software layer for translating data from the internal representation into the types of the language used before clustering, and vice versa, translating the clustering results into an internal format for saving them to the database; a clusterizer – a program that performs clustering of transmitted data according to an algorithm. The proposed library is an implementation of a trigger function, which in fact is an interpreter that connects the clusterizer with the database. If this is the first function operation for the table, then the initial centroids are selected in the way that the user specified. Otherwise, the centroid data is read from the table. There is a demonstration of the library work. The data set for clustering is randomly generated with a concentration around the given centroid coordinates. The library does not limit the user both in the dimension of points that need to be distributed among clusters, and in the number of tables for inserting data. Due to the computational complexity of the algorithms, there is a limit on the maximum amount of data for clustering.

Main Menu

Searching By

PARTNERS

On-the-fly data clustering for the PostgreSQL database management system

Abstract

Advertisement