Abstract:
Python is a programming language that has experienced tremendous growth in its developer
community over recent years. Several reasons can be attributed to this including the ease of
use and readability. Data processing is one crucial area where python is popular. It is the
second most used language for data analysis. Currently, many computer science applications
are becoming more and more data-oriented, especially with the advent of machine learning.
Therefore, performance improvement in the data processing field should be looked upon
keenly. Multithreading is one crucial way to performance improvement in the data processing.
Hence, many languages, including python, support multithreading. However, the
multithreading module of CPython, the reference implementation of python, suffers
controversy due to GIL (Global interpreter lock) which prevents two threads from entering
the same section of code simultaneously. While this is advantageous for I/O bound programs,
it is considered as the performance hit for CPU bound programs. In this research, the
performance of the parallel version of the Aho-Corasick algorithm is compared against its
serial version. Aho-Corasick is a widely used algorithm which addresses the problem of exact
string matching, which is one of a significant problem in the computer science domain. The
parallelising technique used is the division of input text among threads. Here, we have
implemented the serial and parallel versions of AhoCorasick in python and found the realtime elapsed for the parallel portion of the algorithm. The input text size is varied, and graphs
are drawn to interpret the result. Also, we have used benchmark implementations of parallel
and serial versions of the algorithm used for python. The results show that the time
performance of parallel Aho-Corasick algorithm is lower than the serial due to GIL (Global
interpreter lock), while the time performance of the benchmark implementation of the
parallel version is higher than its serial version.