Question: https://codility.com/demo/take-sample-test/distinct
Question Name: Distinct
In this question, the expected worst-case time complexity is O(N*log(N)). Thus the designed, or says official, solution would be sort the input and travel them. This solution is practically good. And it is shown as following.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | def solution(A): if len(A) == 0: distinct = 0 else: distinct = 1 A.sort() for index in xrange(1, len(A)): if A[index] == A[index-1]: # The same element as the previous one continue else: # A new element distinct += 1 return distinct |
In theory, there is a solution with expected worst-case time complexity being O(N) and worst-case space complexity is O(1). But the coefficients of N and 1 are quite high. And this solution depends on the assumption: “each element of array A is an integer within the range [−1,000,000..1,000,000].” These reasons make it practically bad, while it could get 100/100 here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | #include <stdlib.h> // A better and portable way is to include <limits.h> // But I failed to include this head file. I have to hard-code // it here. Even if the CHAR_BIT is larger than 8, our // program should works well, except for wasting some space. // In history, there are some different standard for CHAR_BIT // http://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html // http://pubs.opengroup.org/onlinepubs/7908799/xsh/limits.h.html // But in all cases, CHAR_BIT is at least 8. #define CHAR_BIT 8 int solution(int A[], int N) { unsigned char appeared[2000001/CHAR_BIT+1]; // oneInByte is a pre-compute array. onesInByte[i] = j means there // are j 1s in the binary formation of integer i. unsigned short int onesInByte[] = {0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 1, 2, 2, 3, 2, 3, 3, 4, 2, 3, 3, 4, 3, 4, 4, 5, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 2, 3, 3, 4, 3, 4, 4, 5, 3, 4, 4, 5, 4, 5, 5, 6, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 3, 4, 4, 5, 4, 5, 5, 6, 4, 5, 5, 6, 5, 6, 6, 7, 4, 5, 5, 6, 5, 6, 6, 7, 5, 6, 6, 7, 6, 7, 7, 8}; int index = 0; int result = 0; memset(appeared, 0, 2000001/CHAR_BIT+1); // Record the appeared values for(index = 0; index < N; ++index) { appeared[(A[index]+1000000)/CHAR_BIT] |= (1<<(int)((A[index]+1000000)%CHAR_BIT)); } // Compute the number of distinct values for(index = 0; index < 2000001/CHAR_BIT+1; ++index){ result += onesInByte[appeared[index]]; } return result; } |
Remember not to blindly believe in the Big O.
I’m just using set. Not sure, that its O(N*log(N)) time but codility showed 100%
In practice, it is O(N). In theory, the worst case is O(N^2).
I think for a set shouldn’t it be Nlog(N)? As set generally uses BST, hence it stores the elements in a sorted order inside, We don’t care for the ordering so set wouldn’t be an ideal data structure to use..
If you are aiming for worst case O(N) and average case O(1) you should use unordered_set instead of set. As unordered_set generally uses hashes hence worst case would be O(N) but would guarantee average or best case as O(1)
You are right. I should be more specific. I referred set to hash-implemented set, alos known as unordered_set.
Thanks a lot for your complement!
len(list(set(A))) will also work….
Yeah! len(set(A)) works well! Great Python!!!
Just arrived at the same solution. return len(set(A))
This runtime would be guaranteed to be nlogn in C++, where the set is a RB-Tree. Python uses Hashtables. Where we get into the hashing function discussion…
Funny to get some knowledge about the low-level implementation!
FYI: here is an excellent articel about the choice between RB-Tree and hash table:
http://programmers.stackexchange.com/questions/234793/why-does-python-use-hash-table-to-implement-dict-but-not-red-black-tree
We are not suppose to use high level API such as SET.
By solving this puzzle we are learn what is the actual code inside SET.
Just my 2 cents.
First of all, thanks for your comment and any discussion is welcome!
IMO, set is OK. We are not going to programming from scratch. Library, especially the built-in ones, are good for us. Even the assemble language will provide us some macro as high level API. As long as we can finish the task with provided resources, it should be good.
In addition, I did not use set in my code. Set is typically based on hash table or red black tree. I used sort in the first solution, and bitmap in the second. Bitmap is similar with array, but far from set.
Why so much code? Isn’t that enough?
len({x for x in A})
https://wiki.python.org/moin/TimeComplexity
You cannot guarantee the worst-case time complexity as required if set or dict is used.
And for some programmer, who is not using Python, this code is a little hard to understand. len(set(A)) is better.
In practice, it is O(N). In theory, the worst case is O(N^2).
Here is solution in PHP, please add it to the post
Thanks for sharing!
One more PHP solution – it works in similar way but less code
https://codility.com/demo/results/demoT9VNMS-CFK/
Show time for PHP! I do not know the performance of array_flip() here. But it is really short.
Thanks!
For the exact same solution but written in JavaScript I get 66%: https://codility.com/demo/results/demoMKN37U-GUW/
But in C# it gets 100%. What am I missing ?…
Hi,
The .sort comparator function should return -1 if a is less than b, 0 if they are equal or 1 if a is greater than b. In your code, you are returning a boolean, turning the result array kind of… unpredictable I think.
OMG!!! This is what I expected. The readers, not only me, help each other.
Thanks for your awesome answer!
I GOT 100% in C# it’s Really Short.
But am using inbuild Function here so it is of O(N) Solution i can say.
https://codility.com/demo/results/demoGVEVE9-SY5/
I do not know C#. As far as I see, your solution’s complexity depends on the Distinct() and ToList() functions.
Anyway, thanks for providing another way.
There is another solution using VB.NET code that provides you 100% score in codility Environment.
Solution seems good. While the variable name “index” is misleading.
Hello Sheng,
I love your explanations and proofs. I have been looking your proofs and I’ve been loving them. I addressed this problem by using a dictionary instead, which I believe the worst case complexity is O(N) as I tried to minimize the risk of colliision by checking if the key already exists. Let me know what you think.
In theory, the worst case of operations on disctionary is O(N), not O(1). Therefore, in worst case, the complexity of your solution is O(N^2).
The good news is: the dictionary is nearly always O(1) in practice.
Adding a Java O(N) in practice and in theory O(N^2).
I came with a different idea after thinking about the hash collisions and having the boundaries of the problem in mind, I made the assumption of [-1,000,000 to 1,000,000]:
it scores 100%
https://codility.com/demo/results/demoMTWUSD-S9M/
I got O(n) or possibly O(n log n) at codility, I think it’s O(n), what do you think?
( Btw the Programming Pearls inspired me this solution).
EDIT by admin: this solution has a bug. See following comment.
After having some concerns of the misused space of my previous solutions I decide to try something different, and I found https://en.wikipedia.org/wiki/Hamming_weight#Language_support
So using BitSet-cardinality() method I could use less space and time to compute the answer.
100
https://codility.com/demo/results/demoN57YUC-G9Z/
I also used one BitSet since I was reading that BitSet works better for large arrays.
EDIT by admin: this solution has a bug. See following comment.
Still is problematic. For the value and index in your solution:
BUG~
-1000000 and 0 have the same index in the bitset.
It is O(N). But there IS a bug in your solution: inputLimit should be 1000001 (consider 0 and 1000000).
You’re right, I can’t edit my answer. Great catch!
I appended some warning information. Thanks and enjoy!
Hi All,
Thanks for your valuable comments.
I implemented the following O(N) (100/100 ) solution in https://codility.com/demo/results/demoM8HJHM-FR5/
, and an O(N log N) solution in https://codility.com/demo/results/demoDZS6YY-QK5/ that uses the array’s sorting strategy.
The performance analysis in Codility shows how the O(N) solution is about 25 ms faster than the O(N log N) in the large tests.
By the way, how can the source code be formatted when posting a comment?
Cheers,
Ernesto
Hi, the following is just another O(N Log N) 100/100 variation based on the idea of sorting the array. https://codility.com/demo/results/demo3ESX6Y-AX7/.
I prefer the O(N) solution posted previously even if the code is not as self-documented as this one.
Cheers,
Ernesto
Hi,
I prefer to use built-in functions. 🙂
Just 1 thing you need to add as per the question:-
JDK 8 for help:
https://codility.com/demo/results/demoQGMNC3-Q8X/
javascript solution 100%
Another solution for Javascript 100% both
In Python:
Maybe newer python … not sure when Counter was added. Very fast.
When I read the instructions I thought … it has to be harder than this … especially after that MinAvgTwoSlice nightmare.
It does come back as O(N * Log(N)) worst case, O(N) expected. So maybe not fully optimized.
I thought of doing something outside of the box of how to count the distinct values using dictionary. The time complexity of my solution was: O(N*log(N)) or O(N)
In C++:
Well this is my solution.
My Python code using a Dictionary:
Use a dictionary or just use a set.