Repeated DNA Sequences

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

https://leetcode.com/problems/repeated-dna-sequences/

Discussion

用个hash set来保存每个长为10的substring是否出现过,当再次出现的时候就保存到结果里,这样的话结果里可能有重复的substring,所以先临时存到一个hash set里再保存到result里。也可以用hash matp 统计次数。

Solution

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        if(s.length() < 10) return result;
        unordered_set<string> substring_sets;
        unordered_set<string> repeated;
        for(int i=0; i<=s.length()-10; i++) {
            string sub_str = s.substr(i, 10);
            if(substring_sets.find(sub_str) != substring_sets.end()) {
                repeated.emplace(sub_str);
            } else {
                substring_sets.emplace(sub_str);
            }
        }
        for(auto const& str : repeated) {
            result.emplace_back(str);
        }
        return result;
    }
};

Follow up

真的这么简单吗?

上面用hash set保存string,浪费了大量的空间,可以考虑用整数存储,即二进制方法。这个思路非常简单,这里一共有四个字母:A,C,G,T。我们转换整数的思路如下:
A = 00,C = 01,G = 10,T = 11。
int key = 0, key = key << 2 | code(A|C|G|T)。 //左移两位再跟上ACGT对应的code
这样我们就很容易把一个字符串转换为整数了。 然后hash set里面保存整数就可以了。

因为只是10为的substring,占20位,没有超过int的32位,所以是可行的。

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        if(s.length() <= 10) return result;
        unordered_map<int, int> counter; //count the repeatation of substrings
        for(int i=0; i<=s.length()-10; i++) {
            string sub = s.substr(i, 10);
            int key = GetKey(sub);
            if(counter[key]++ == 1) {
                result.emplace_back(sub);
            }
        }
        return result;
    }
private: 
    //encode the string with binary code
    int GetKey(string &s) {
        int key = 0;
        int code = 0;
        for(auto const& c : s) {
            switch(c) {
            case 'A':
                //00
                code = 0;
                break;
            case 'C':
                //01
                code = 1;
                break;
            case 'G':
                //10
                code = 2;
                break;
            case 'T':
                //11
                code = 3;
                break;
            default:
                code = 0;
            }
            key = key << 2 | code;
        }
        return key;
    }
};

更简洁的写法:

class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        if(s.length() <= 10) return result;
        unordered_map<int, int> counter; //count the repeatation of substrings
        //code for A, C, G, T
        int code[26] = {0};
        code[0] = 0; // 'A' -- '00'
        code[2] = 1; // 'C' --01  C-A==2
        code['G'-'A'] = 2;
        code['T'-'A'] = 3;
        for(int i=0; i<=s.length()-10; i++) {
            //string sub = s.substr(i, 10);
            //int key = GetKey(sub);
            //get key for a 10-letter substring, don't use sutstr function here.
            int key = 0;
            for(int j=i; j<i+10; j++) {
                key = key << 2 | code[s[j]-'A'];
            }
            if(counter[key]++ == 1) {
                result.emplace_back(s.substr(i, 10));//construct sub string here
            }
        }
        return result;
    }
};

Solution - vis Trie

用Trie也能节省空间并且方便统计次数,但是还是谁memory limit。
但如果是动态的input string的话用Trie明显更好。

class DnaSubSeq {
public:
    DnaSubSeq() {
        root = new Node();
        root->count = 0;
    }
    //return frequency of s
    int addSequ(string s) {
        Node *cur = root;
        for(auto const& c : s) {
            if(cur->neighbors.find(c) == cur->neighbors.end()) {
                cur->neighbors[c] = new Node();
            }
            cur = cur->neighbors[c];
        }
        cur->count++;
        return cur->count;
    }
private:
    struct Node {
        int count = 0;
        unordered_map<char, Node*> neighbors;
        Node() {};
    };
    Node *root;
};


class Solution {
public:
    vector<string> findRepeatedDnaSequences(string s) {
        vector<string> result;
        if(s.length() <=10) return result;
        DnaSubSeq dna_sub;
        for(int i=0; i<=s.length()-10; i++) {
            string sub = s.substr(i, 10);
            if(dna_sub.addSequ(sub) == 2) {
                result.emplace_back(sub);
            }
        }
        return result;
    }
};

Reference

  1. http://blog.csdn.net/wzy_1988/article/details/44224749
  2. https://leetcode.com/discuss/24478/i-did-it-in-10-lines-of-c
  3. https://leetcode.com/discuss/44689/10-lines-c-code-8-ms-passed

results matching ""

    No results matching ""