Algorithms for Massive Data frequency dataset reservoir summary misra-Gries summery countmin Sketch

Anonymous
Best to complete in 1 day, and python. 1. Execute bruteforce computation (10 pts): Compute the frequency vector of
all words, descendingly sort the words by their frequencies, and save the result
into le (since you might use the bruteforce result for the next tasks). You need
to report:
(a) The average frequency of the words in stream (5 pts).
(b) Plot the sorted frequency of words to observe the skewed distribu-
tion (5 pts).
2. Reservoir Summary (10 pts): Implement Reservoir Sampling to see the skewed
distribution of our frequency vector. Fix the summary size S = 10; 000, you need
to:
(a) Estimate the frequency vector from our Reservoir Summary, and
plot this estimate vector to see the approximation skewness (5 pts).
(b) Run your Reservoir Sampling 5 times and report the average num-
ber of times the summary has been updated over these 5 runs (5
pts).
3. Misra-Gries Summary (15 pts): Implement Misra-Gries summary to nd the
most frequent words whose frequency is larger than 1,000. You need to:
(a) Explain the size of summary you choose such that you can nd
these frequent words (5 pts).
(b) Run your Misra-Gries summary and report the number of decre-
ment steps with your chosen parameter (10 pts).
4. CountMin Sketch (15 pts): Implement CountMin sketch to estimate the fre-
quency of words. You need to:
(a) Explain the size of summary you choose such that the estimate
error is at most 100 (5 pts).
An answer.pdf le reports the requested values and explanation of each task.
A source code le contains detailed comments.
(b) Run your CountMin Sketch with your chosen parameters, and re-
port the estimate of the frequency of the words, whose frequency
is larger than 1,000 found in the bruteforce algorithm (10 pts).

admin

Author Since: November 30, 2020