Months ago, I coded a tiny crawler to monitor some electronics deals.
I extracted a list of my target products from the homepage, and on each product page tried to add it to my cart. Basically I need to:
- Check each product page if it’s marked as available on home page;
- The interval between two attempt for the same product page is at least 1 hour. That is, in the hour following the first check, we do nothing. After one hour, if it’s still marked as available, try it again. Otherwise, do nothing.
- The number of unique product pages is small, like less than 100.
To achieve that, I made a simple expiring set in Python. It does not consider garbage collection, because
- The same items are added from time to time;
- The number of unique items is small.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | from time import time class ExpiringSet(): def __init__(self, max_age_seconds): assert max_age_seconds > 0 self.age = max_age_seconds self.container = {} def contains(self, value): if value not in self.container: return False if time() - self.container[value] > self.age: del self.container[value] return False return True def add(self, value): self.container[value] = time() |
It’s an interesting tips for the time constraint problem. Nice to make it a Class.
Thanks for sharing.
Daniel