I’ve been subscribed to Interview Cake for years, and today they had a really interesting question: Given a list of n + 1 integers in the range 1...n, find one of the duplicates (there is guaranteed to be at least one) in O(n) time and O(1) additional space. The answer is really interesting[1], and I recommend trying it, but I don’t think it makes sense to care about additional space rather than total space, and I still think using a set to keep track of numbers (the most obvious solution) is the best solution in practice.
The reason I thought this was worth writing about is that the “O(1)” linked list algorithm can’t be streamed, so it doesn’t make any sense to treat the memory the input uses and the additional memory as fundamentally different (they both need to be allocated on the same machine). Since the algorithm requires that you hold the input in memory, the only sane way to treat the space complexity to consider the total, i.e. the linked list solution is O(n) for the input + O(1) for the rest of the algorithm, which is still O(n).
In this specific case, the constant factors are also huge. The numbers are guaranteed to be densely packed and in the range 1...n, so our set can be a bit array of length n (i.e. to mark “2” as being in our set, we set bit 2 of our array). This means that our additional space is actually going to be smaller than our input by a factor of the size of our integer type (i.e. with 32-bit integers, the input will be 32x larger than our bit array). With any reasonable integer size, the size of the input is going to be massively larger than the additional memory for our set.
In a previous post on a similar question, I pointed out that a hash table with a million 32-bit entries will only use 8 MB of memory, but this case is even better. If you want to find the duplicate in a list of a million 32-bit integers, your list will take 4 MB of memory but the set of numbers you’ve seen will only take 125 KB of memory.
Finally, remember how I said that the linked list solution can’t be streamed? Well, the set solution can, so taking that and the bit array optimization into account, the set solution can use substantially less memory in-practice because it only has to store the extremely-efficient bit set and doesn’t have to hold the entire input in memory!
For reference, here’s a Python solution that uses substantially less memory than the linked-list solution when streaming the input:
from typing import Any, Iterable
# python3 -m pip install bitarray
from bitarray import bitarray
# We need a resizable bit set for streaming since we don't know the largest value we'll see
# If we do know the largest value, we can replace this entire class with bitarray(max_value)
class BitSet:
"""
A set wrapper around bitarray
Note that the set can only contain integers, and this set only makes sense if the
integers it contains are in a predictable, densely packed range (specifically,
when max_value - min_value / 64 > 1).
"""
def __init__(self, min_value: int = 0, initial_size: int = 256) -> None:
"""
Initialize a bit set
min_value is the smallest value that can be contained in the set (i.e. it will
be mapped to index 0)
"""
self.min_value = min_value
self._set = bitarray(initial_size)
self._set.setall(0)
def __contains__(self, value: Any) -> bool:
return (
isinstance(value, int)
and 0 <= value - self.min_value < len(self._set)
and self._set[value - self.min_value] == 1
)
def ensure_size(self, min_length: int) -> None:
"""
Ensure the underlying bit array is at least as large the given min_length
(resizing if necessary).
"""
starting_size = len(self._set)
if min_length > starting_size:
# Don't do a bunch of tiny resizes
min_length = max(min_length, starting_size * 2)
# Create a new array, initialize it, and copy the old data if applicable
old_set = self._set
self._set = bitarray(min_length)
self._set[starting_size:] = 0
if old_set:
self._set[:starting_size] = old_set
def add(self, value: int) -> None:
assert value >= self.min_value
mapped_value = value - self.min_value
self.ensure_size(mapped_value + 1)
self._set[mapped_value] = 1
def find_first_duplicate(int_list: Iterable[int]) -> int:
"""
Find one of the duplicates in a list of n + 1 integers in the range 1...n
"""
seen = BitSet(min_value=1)
for value in int_list:
if value in seen:
return value
seen.add(value)
assert False, "No duplicates found, which is impossible with conforming inputs"
Additional space complexity isn’t always a useful metric
Link post
I’ve been subscribed to Interview Cake for years, and today they had a really interesting question: Given a list of
n + 1
integers in the range1...n
, find one of the duplicates (there is guaranteed to be at least one) in O(n) time and O(1) additional space. The answer is really interesting[1], and I recommend trying it, but I don’t think it makes sense to care about additional space rather than total space, and I still think using a set to keep track of numbers (the most obvious solution) is the best solution in practice.The reason I thought this was worth writing about is that the “O(1)” linked list algorithm can’t be streamed, so it doesn’t make any sense to treat the memory the input uses and the additional memory as fundamentally different (they both need to be allocated on the same machine). Since the algorithm requires that you hold the input in memory, the only sane way to treat the space complexity to consider the total, i.e. the linked list solution is O(n) for the input + O(1) for the rest of the algorithm, which is still O(n).
In this specific case, the constant factors are also huge. The numbers are guaranteed to be densely packed and in the range
1...n
, so our set can be a bit array of lengthn
(i.e. to mark “2” as being in our set, we set bit 2 of our array). This means that our additional space is actually going to be smaller than our input by a factor of the size of our integer type (i.e. with 32-bit integers, the input will be 32x larger than our bit array). With any reasonable integer size, the size of the input is going to be massively larger than the additional memory for our set.In a previous post on a similar question, I pointed out that a hash table with a million 32-bit entries will only use 8 MB of memory, but this case is even better. If you want to find the duplicate in a list of a million 32-bit integers, your list will take 4 MB of memory but the set of numbers you’ve seen will only take 125 KB of memory.
Finally, remember how I said that the linked list solution can’t be streamed? Well, the set solution can, so taking that and the bit array optimization into account, the set solution can use substantially less memory in-practice because it only has to store the extremely-efficient bit set and doesn’t have to hold the entire input in memory!
For reference, here’s a Python solution that uses substantially less memory than the linked-list solution when streaming the input:
If you found the Interview Cake question interesting, you might also like this Veritasium video explaining a related riddle.