Zamjene Debugging

Problem Info

I’m doing the problem Zamjene.

My Work

#include <iostream>
#include <vector>
#include <map>
#include <algorithm>

using std::cout;
using std::endl;
using std::vector;

class DominikArray {
    private:
        const int POW = 1000003;
        const int MOD = 1e9 + 9;

        vector<int> arr;
        vector<int> sorted;
        
        vector<int> parent;
        vector<int> size;
        int bad_num = 0;

        std::map<int, long long> elem_pow;  // raise to the power of the value
        // the current hash of a component
        vector<long long> hash;
        // the needed hash for a component to be able to be sorted
        vector<long long> req_hash;
        std::map<long long, int> bad_diff;  // the hash differences of the bad components
        long long cloud_pairs = 0;

        int get_top(int n) {
            return parent[n] == n ? n : (parent[n] = get_top(parent[n]));
        }

        inline bool is_unsortable(int n) {
            return hash[n] != req_hash[n];
        }

        /** if a component is bad, add it to the bad log */
        void add_if_bad(int n) {
            if (is_unsortable(n)) {
                bad_num++;
                long long diff = req_hash[n] - hash[n];
                bad_diff[diff] += size[n];

                long long pair_amt = bad_diff[-diff] + bad_diff[MOD - diff] + bad_diff[-MOD - diff];
                cloud_pairs += pair_amt * size[n];
            }
        }

        void remove_if_bad(int n) {
            if (is_unsortable(n)) {
                bad_num--;
                long long diff = req_hash[n] - hash[n];
                bad_diff[diff] -= size[n];

                long long pair_amt = bad_diff[-diff] + bad_diff[MOD - diff] + bad_diff[-MOD - diff];
                cloud_pairs -= pair_amt * size[n];
            }
        }
    public:
        DominikArray(vector<int> arr)
            : arr(arr), parent(arr.size()), size(arr.size(), 1),
              hash(arr.size()), req_hash(arr.size()) {
            sorted = arr;
            std::sort(sorted.begin(), sorted.end());

            // perform coordinate compression
            long long curr_pow = 1;
            for (int i : sorted) {
                if (!elem_pow.count(i)) {
                    elem_pow[i] = curr_pow;
                    curr_pow = (curr_pow * POW) % MOD;
                }
            }
            
            // set up DSU and the hashes
            for (int i = 0; i < arr.size(); i++) {
                parent[i] = i;
                hash[i] = elem_pow[arr[i]];
                req_hash[i] = elem_pow[sorted[i]];
                add_if_bad(i);
            }
        }

        void swap(int a, int b) {
            int top_a = get_top(a);
            int top_b = get_top(b);

            // temporarily take them out of the bad log (if applicable)
            remove_if_bad(top_a);
            remove_if_bad(top_b);

            // change the hashes of the two components
            hash[top_a] = hash[top_a] + elem_pow[arr[b]] - elem_pow[arr[a]] + MOD;
            hash[top_a] = hash[top_a] % MOD;
            hash[top_b] = hash[top_b] + elem_pow[arr[a]] - elem_pow[arr[b]] + MOD;
            hash[top_b] = hash[top_b] % MOD;

            // add the back (if applicable)
            add_if_bad(top_a);
            add_if_bad(top_b);

            std::swap(arr[a], arr[b]);
        }

        void link(int a, int b) {
            a = get_top(a);
            b = get_top(b);
            if (a == b) {
                return;
            }

            if (size[a] < size[b]) {
                return link(b, a);
            }
            
            remove_if_bad(a);
            remove_if_bad(b);
            
            size[a] += size[b];
            parent[b] = a;
            
            // just add the hash of the smaller component to the bigger one
            hash[a] = (hash[a] + hash[b]) % MOD;
            req_hash[a] = (req_hash[a] + req_hash[b]) % MOD;

            add_if_bad(a);
        }

        bool sortable() {
            // for everything to be sortable, there can't be any bad components
            return bad_num == 0;
        }

        long long needed_pair_num() {
            return cloud_pairs;
        }
};

// https://oj.uz/problem/view/COCI16_zamjene (input omitted due to length)
int main() {
    std::ios::sync_with_stdio(false);
    std::cin.tie(NULL);

    int arr_len;
    int query_num;
    std::cin >> arr_len >> query_num;
    vector<int> arr(arr_len);
    for (int& i : arr) {
        std::cin >> i;
    }
    
    DominikArray array(arr);
    for (int q = 0; q < query_num; q++) {
        int type;
        std::cin >> type;
        int a, b;  // not necessarily used (queries of type 3 & 4)
        switch (type) {
            case 1:
                std::cin >> a >> b;
                array.swap(--a, --b);
                break;
            case 2:
                std::cin >> a >> b;
                array.link(--a, --b);
                break;
            case 3:
                cout << (array.sortable() ? "DA" : "NE") << '\n';
                break;
            case 4:
                cout << array.needed_pair_num() << '\n';
                break;
        };
    }
}

So I took a quick peek at the official solution- I realize that the key is to be able to hash a multiset of integers. The hash is P\cdot x_1+P^2\cdot x_2 \cdots, where x_i is the number of occurrences of the i-th number in the set. Then, you can implement queries with a DSU and a map tracking the number of “bad”, or unsortable components, to answer the output queries.

What I’ve Tried

I’ve tried running a brute force script, changing everything to long longs for overflow, standard stuff.
Here’s my generator script:

from random import randint

SIZE = 20
QUERY_NUM = 30
MAX_NUM = 10 ** 6

arr = [randint(1, MAX_NUM) for _ in range(SIZE)]
print(SIZE, QUERY_NUM)
print(*arr)
for _ in range(QUERY_NUM):
    q_type = randint(1, 4)
    if q_type in [3, 4]:
        print(q_type)
    else:
        a = randint(1, SIZE)
        b = randint(1, SIZE)
        while b == a:
            b = randint(1, SIZE)
        print(q_type, a, b)

However, this only gets 84/140 on the actual grader. I’ve been scratching my head for weeks over this problem… any help?

Probably you need double hashing? (use two different powers or two different moduli). With just one small modulus the chance of an accidental collision is quite large.

(This should show up if you generate test cases closer to the maximum size.)

The thing is, the actual solution just uses one modulus and that one got accepted.

The official solution defines MOD but doesn’t use it.

Technically signed integer overflow is not defined. But in practice, I guess it wraps around modulo 2^{64}, which is why the solution works.

(Though polynomial hashing mod 2^{64} isn’t safe (Anti-hash test. - Codeforces) so I assume the above submission is hackable.)

So I should just use double hashing? Are you sure it’s due to a collision?

Yes. Why don’t you try estimating the probability of a collision?

Alright- so I take it the official solution gets AC by sheer luck?

No, mod 2^{64} works well on random data.

Well, removing the modulus in my solution made it get accepted. Should I rely on this in future contests? I don’t really do Codeforces, so hacking isn’t particularly a concern of mine.

If you don’t think there are any anti-hash test cases then it seems safe to do so.

Btw it’s worth noting that you don’t need polynomial hashing for this problem. Instead, you can simply associate each number in [1,10^6] with a random 64-bit integer.