Consider the scenario where we are receiving a stream of strings and
want to store only unique strings. One approach of achieving this
in python would be as below:
# Assuming values() returns a constant stream of strings
unique_vals = []
for val in self.values():
if val not in unique_vals:
unique_vals.append(val)
However in python strings are immutable and are hashable. Assuming we
are not interested in ordering wouldn’t it be better if we use a set
instead of list as the container to gather unique strings. So the
above code snippet can be written as:
unique_vals = set()
for i in self.values():
unique_vals.add(val)
But which of this is faster ? I think the version using the set
would be faster since membership tests would make use of hash values
of the set elements. One way that this can be verified is by running
some benchmark tests using python’s timeit module.
Here are some of the tests I ran and the results I got.
>>> from timeit import timeit >>> timeit(stmt='x = range(100); 99 in x') 4.287418931735737 >>> timeit(stmt='x = set(range(100)); 99 in x') 7.882405166019703
In the above example I create a list of numbers till 99 and check if
the number 99 exists in the container (list and set). I choose 99
because the membership test in a list would be a linear search and
choosing a value of 99 ensures that we are taking into account the
worst case scenario for benchmarking. Surprisingly the version
involving set takes more time (close to 2x) than the one that involves
the list. But a closer look reveals a problem in the benchmarking
code, the benchmark tests not only compare the lookups but also the
time it takes initialize the sequence (set or list).
So I run the benchmark test ignoring the time involved for sequence
initialization.
>>> timeit(setup='x=range(100)',stmt='99 in x') 2.739381960556443 >>> timeit(setup='x=set(range(100))',stmt='99 in x') 0.08809443658344662
As the results above indicate that snippet that uses the set
outperforms the one that uses the list.
So moving further and running benchmarks that is closer to the problem
I started with, here are the results I get:
>>> timeit(setup='x=[]',stmt="""for i in range(10)+range(15): if i not in x: x.append(i)""") 7.740467014662272 >>> timeit(setup='x=set()',stmt="""for i in range(10)+range(15): x.add(i)""") 5.565552325784438
As the above results indicate, to find unique elements in a sequence of
hashables it is efficient to use a set instead of a list (assuming we are not
interested in preserving the ordering).
