Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
https://leetcode.com/problems/repeated-dna-sequences/
Discussion
用个hash set来保存每个长为10的substring是否出现过,当再次出现的时候就保存到结果里,这样的话结果里可能有重复的substring,所以先临时存到一个hash set里再保存到result里。也可以用hash matp 统计次数。
Solution
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
if(s.length() < 10) return result;
unordered_set<string> substring_sets;
unordered_set<string> repeated;
for(int i=0; i<=s.length()-10; i++) {
string sub_str = s.substr(i, 10);
if(substring_sets.find(sub_str) != substring_sets.end()) {
repeated.emplace(sub_str);
} else {
substring_sets.emplace(sub_str);
}
}
for(auto const& str : repeated) {
result.emplace_back(str);
}
return result;
}
};
Follow up
真的这么简单吗?
上面用hash set保存string,浪费了大量的空间,可以考虑用整数存储,即二进制方法。这个思路非常简单,这里一共有四个字母:A,C,G,T。我们转换整数的思路如下:
A = 00,C = 01,G = 10,T = 11。
int key = 0, key = key << 2 | code(A|C|G|T)。 //左移两位再跟上ACGT对应的code
这样我们就很容易把一个字符串转换为整数了。 然后hash set里面保存整数就可以了。
因为只是10为的substring,占20位,没有超过int的32位,所以是可行的。
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
if(s.length() <= 10) return result;
unordered_map<int, int> counter; //count the repeatation of substrings
for(int i=0; i<=s.length()-10; i++) {
string sub = s.substr(i, 10);
int key = GetKey(sub);
if(counter[key]++ == 1) {
result.emplace_back(sub);
}
}
return result;
}
private:
//encode the string with binary code
int GetKey(string &s) {
int key = 0;
int code = 0;
for(auto const& c : s) {
switch(c) {
case 'A':
//00
code = 0;
break;
case 'C':
//01
code = 1;
break;
case 'G':
//10
code = 2;
break;
case 'T':
//11
code = 3;
break;
default:
code = 0;
}
key = key << 2 | code;
}
return key;
}
};
更简洁的写法:
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
if(s.length() <= 10) return result;
unordered_map<int, int> counter; //count the repeatation of substrings
//code for A, C, G, T
int code[26] = {0};
code[0] = 0; // 'A' -- '00'
code[2] = 1; // 'C' --01 C-A==2
code['G'-'A'] = 2;
code['T'-'A'] = 3;
for(int i=0; i<=s.length()-10; i++) {
//string sub = s.substr(i, 10);
//int key = GetKey(sub);
//get key for a 10-letter substring, don't use sutstr function here.
int key = 0;
for(int j=i; j<i+10; j++) {
key = key << 2 | code[s[j]-'A'];
}
if(counter[key]++ == 1) {
result.emplace_back(s.substr(i, 10));//construct sub string here
}
}
return result;
}
};
Solution - vis Trie
用Trie也能节省空间并且方便统计次数,但是还是谁memory limit。
但如果是动态的input string的话用Trie明显更好。
class DnaSubSeq {
public:
DnaSubSeq() {
root = new Node();
root->count = 0;
}
//return frequency of s
int addSequ(string s) {
Node *cur = root;
for(auto const& c : s) {
if(cur->neighbors.find(c) == cur->neighbors.end()) {
cur->neighbors[c] = new Node();
}
cur = cur->neighbors[c];
}
cur->count++;
return cur->count;
}
private:
struct Node {
int count = 0;
unordered_map<char, Node*> neighbors;
Node() {};
};
Node *root;
};
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> result;
if(s.length() <=10) return result;
DnaSubSeq dna_sub;
for(int i=0; i<=s.length()-10; i++) {
string sub = s.substr(i, 10);
if(dna_sub.addSequ(sub) == 2) {
result.emplace_back(sub);
}
}
return result;
}
};