Question Regarding Bovine Genomics US Open 2017 Silver

Stuck on the problem Bovine Genomics under the Complete Search section of the guide.

My understanding of the problem is as follows: if any three distinct positions in spotty[n] - (s_i,s_j,s_k) - equal the same positions in plain[n] - (p_i,p_j,p_k) - then that set will be not be considered distinct.

I came up with the following approach to check each position for all possible three pairs:

string spotty[n], plain[n];
//after taking input
int ans = 0;
for(int i=0; i<m; i++) {
    for(int j=i+1; j<m; j++) {
        for(int k=j+1; k<m; k++) {
            bool check = false;
            for(int a=0; a<n; a++) {
                if(spotty[a][i] != plain[a][i] || 
                   spotty[a][j] != plain[a][j] || 
                   spotty[a][k] != plain[a][k]) check = true;
            }
            if(check) ans++;
        }
    }
}
cout << ans << endl;

For the sample test case, this code block returns ans as 36 (while the correct value should be 22). The Official Analysis states that we need to check for if we have already visited any similar pairs. I am not able to wrap my head around this part, don’t we just need to check that for any three distinct positions the sequence of characters in the spotty genome doesn’t match with the characters in the plain genome.

How exactly is the official analysis code distinguish that a set has already been visited? I understand that converting each character value to base 4 and then comparing their collective values will confirm if a set is distinct or not, but how is this helping it keep track of sets they have already considered (as stated in the analysis)?

Two logical mistakes: Your code is comparing only spotty cow number a with plain cow number a. Also you’ve inverted the logic, check should start with true and become false if any of the triplets matches.

There is no special relationship between spotty a and plain a. Instead you have to check all the plain cows to see if they match a spotty cow. This can be done in O(n^2) by checking all pairs, but O(n^2m^3) is too slow for this problem.

Convert the genomes to an integer in [0, 64) with base 4. Now the problem becomes: check if two arrays of integers intersect. This can solved efficiently by building a bool vis[64] array out of one array, then traversing the other array. This is O(n), so overall we have O(nm^3), around 50 choose 3 * 500 = 10 mill operations.

More generally if the numbers are much bigger than 64 you may want to use a set<int> data structure instead of a bool array.