<> Problem description

Huffman coding is a very effective coding method widely used in data file compression . The compression ratio is usually 20%～90% between . Huffman coding algorithm uses the frequency table of characters in the file to establish a 0,1 String represents the optimal representation of each character . One contains 100,000 Character file , The frequency of each character is different , Suppose a source is generated a,b,c,d and f
6 Kind of symbol , The frequency is shown in the table below , Unit: 1000 times .

As can be seen from the table a Characters appear most frequently ——45 Thousand times ,f Lowest frequency ——5 Thousand times .
To distinguish 6 Different characters require a length of 3 of 0-1 String representation , So it's natural to think that a Code as 000,b Code as 001, And so on . This non discriminatory coding is fixed length coding .

The advantage of fixed length coding is simple decoding , Because all encoding lengths are equal to 3, So when decoding , By length 3 decoding , Then it's right .
For example, I received 000 001 000 character string , By length 3 Separate disconnection , therefore 000 Decode as a,001 Decode as b, Decoding according to this rule will get the information we need .

<> Problem objectives

Given coded character set C And any character c Frequency of occurrence of f©, Average code length of fixed length coding .

<> Solving objectives

Find a coding scheme that minimizes the average code length .

Fixed length code

Average code length of fixed length code ：
（45+13+12+16+9+6）* 3 =300
So is the fixed length coding scheme the best coding for the character set ? no

stay 1951 year , Huffman is here MIT Information theory course , The term paper needs to be completed ： Find the most efficient binary encoding .
1952 year , According to Shannon （Shannon） stay 1948 Nian he fan Ruo (Fano) stay 1949 The coding idea described in puts forward a method of indefinite length coding , Also known as Huffman coding , Optimal coding .

In the fixed length code scheme , The difference of character frequency is not considered .

For the example just now , To the character set 6 Characters , Some people have given the following coding methods

There seems to be no problem . So can this coding scheme ensure the correctness of decoding ?
Suppose we receive a string of 0 1 0 1 1 1 0 1, Decode according to the coding scheme ：

The first character is 0, Decode as a, The second character after is 1, decoding b, still 0
hinder 1 0 Characters are decoded into characters together c ? still 101 Decode together as f And ? Obviously, this coding scheme can not guarantee correct decoding .
Reason: the first variable length code is 0, Direct decoding to a character , that is because 0 Not a prefix for the rest of the encoding , To the later code 1 0 1
Why can't you code it correctly ? We can see that in this coding scheme ,b Coding of 1 He is a character c code 1 0, character f code 1 0 1 Prefix of , Therefore, it leads to the diversity of decoding .
That means , In addition to considering the frequency of characters, variable length coding , We also need to consider that the encoding of each character can not make the prefix of other characters , Therefore, the concept of prefix code is introduced ：

Prefix code ：
Specify one for each character 0,1 String as its code , It is required that the code of any character is not the prefix of other character codes . This code is called prefix code . The prefix property of coding can make the decoding method very simple ; for example 001011101 Can be uniquely decomposed into 0,0,101,1101, Therefore, it is decoded as aabe.

Then the prefix code can ensure the certainty of decoding ? give an example
The encoding of the character set is as follows

The coding scheme in the figure above is the prefix code , The code of any character is not the prefix of other character codes , Next, decode the string 0 1 0 1 0 0 0 1
There are and only such unique decoding results

Average code length ：
The average code length of fixed length code is 300, Then for this variable length coding scheme, the average code length is ：451 +133 123 +163 +9*4 + 5 *4 =224
Obviously, the variable length coding scheme is smaller than the average code of fixed length coding , Compression rate reaches 25%

Here's the analysis , For a given character set , Huffman coding problem is transformed into ： Find the problem of prefix coding to minimize the average code length . How to find prefix code ?

For average code length 224 Coding scheme of , Organize it into a binary tree , In the following binary tree , have 6 Leaf nodes , In binary tree , It is customary to mark the road from the parent node to the left son as 0, The road to the right son is marked 1, Observe from the roots 100, Path to each leaf , To character a The path to is 0, reach b The path to is 1
0 1
It just corresponds to 6 Prefix code scheme of characters
Look at the tree , Out of the leaf node , Other nodes have 2 A son , Therefore, it is a binary complete tree , So here comes the question... , Why in a binary complete tree , Why is the path from tree root to leaf prefix code ?
Because it's a leaf , Therefore, the encoding of any one character , Are not prefixes of other character codes .

<> Huffman coding

Solving objectives ： Find the prefix code scheme that minimizes the average code length .

<> How to solve Huffman coding problem

greedy strategy ： Low frequency characters , Priority to join the team

step
·, Frequent by character f©, Non subtractive arrangement , Put in queue Q
2, From queue Q Select the two trees with the lowest frequency and merge them , Insert new frequency into queue Q Until you get n Binary complete tree of nodes .

The Huffman code is reduced to the solution idea of no guess . Two hours refers to the smallest one at present 2 Frequency , No guess means unconditionally greedy to choose them to merge

The main amount of calculation of the above algorithm is the sorting of installation frequency , Therefore, the time complexity is O(nlog
).

<> give an example

character set C={a,b,c,d,e} , F={7,16,12,9,8}
solution ： Sorted by frequency, the current queue is ：
7 8 9 12 16
Add to queue 7,8 The frequency of merging out of the team is 15 New node

Get new queue ：9 ,12, 15, 16
Then in the queue 9 and 12 The frequency of merging out of the team is 21 New node of

Merge in turn , Get the following binary tree
Next, find out the prefix code scheme corresponding to the binary complete binary tree

Character frequency
a100
b101
c00
d01
e11
Average code length B（T）=37 +38 +29 +212 +2*16 =119

explain
Road signs are customary ：0 Indicates a node to the left ,1 Indicates a node to the right
Of the same parent node 2 The order of nodes is arbitrary , Huffman code is not unique
The average code length of the optimal coding is consistent .

Less nonsense, less code
include <iostream> using namespace std; // Maximum character encoding array length #define MAXCODELEN 100
// Maximum number of Huffman node structure arrays #define MAXHAFF 100 // Maximum number of Huffman coded structure arrays #define MAXCODE 100 #
define MAXWEIGHT 10000; typedef struct Haffman { // weight int weight; // character char ch;
// Parent node int parent; // Left son node int leftChild; // Right son node int rightChild; }HaffmaNode;
typedef struct Code { // Storage of Huffman coding of characters int code[MAXCODELEN]; // Where do you start int start; }
HaffmaCode; HaffmaNode haffman[MAXHAFF]; HaffmaCode code[MAXCODE]; void
buildHaffman(int all) { // Work before initialization of Huffman node ,
weight by 0,parent,leftChile,rightChile All for -1 for (int i = 0; i < all * 2 - 1; ++i) {
haffman[i].weight = 0; haffman[i].parent = -1; haffman[i].leftChild = -1;
haffman[i].rightChild = -1; } std::cout << " Please enter the character and weight size that need Huffman encoding " << std::endl;
for (int i = 0; i < all; i++) { std::cout << " Please enter the " << i << " Huffman characters and weights " << std
::endl; std::cin >> haffman[i].ch; std::cin >> haffman[i].weight; }
// Find the node with the smallest weight each time , Generate a new node , need all - 1 Secondary merger int x1, x2, w1, w2; for (int i = 0; i < all
- 1; ++i) { x1 = x2 = -1; w1 = w2 = MAXWEIGHT; // Notice that every time here all + i It's convenient inside for (int j
= 0; j < all + i; ++j) { // Get the node with the lowest weight if (haffman[j].parent == -1 && haffman[j].
weight< w1) { // If every minimal update , Then you need to give the last smallest one to the second smallest one w2 = w1; x2 = x1; x1 = j; w1 =
haffman[j].weight; } // Used here else if Not if, Because they only choose at a time 1 Just one . else if(haffman[j].
parent== -1 && haffman[j].weight < w2) { x2 = j; w2 = haffman[j].weight; } }
// After finding the smallest two nodes at a time, remember to merge them into a new node haffman[all + i].leftChild = x1; haffman[all + i].
rightChild= x2; haffman[all + i].weight = w1 + w2; haffman[x1].parent = all + i;
haffman[x2].parent = all + i; std::cout << "x1 is" << x1 <<" x1 parent is"<<
haffman[x1].parent<< " x2 is" << x2 <<" x2 parent is "<< haffman[x2].parent<< "
new Node is " << all + i << "new weight is" << haffman[all + i].weight << std::
endl; } } // Print Huffman code for each character void printCode(int all) { // Save the character code of the current leaf node HaffmaCode
hCode; // Current parent node int curParent; // Number of subscript and leaf nodes int c; // Total number of Traversals for (int i = 0; i <
all; ++i) { hCode.start = all - 1; c = i; curParent = haffman[i].parent;
// The condition of traversal is that the parent node is not equal to -1 while (curParent != -1) { // Let's get the parent node first , Then judge whether the left node is the current value , If it is a node 0
// Otherwise, take the node 1, there c Will change , So don't use it i express , We use c Save current variable i if (haffman[curParent].leftChild == c) {
hCode.code[hCode.start] = 0; std::cout << "hCode.code[" << hCode.start << "] =
0" << std::endl; } else { hCode.code[hCode.start] = 1; std::cout <<
"hCode.code[" << hCode.start << "] = 1" << std::endl; } hCode.start--; c =
curParent; curParent = haffman[c].parent; } // Save the current leaf node information into the coding structure for (int j =
hCode.start + 1; j < all; ++j) { code[i].code[j] = hCode.code[j]; } code[i].
start= hCode.start; } } int main() { std::cout << " Please enter how many Huffman characters are there " << std::endl;
int all = 0; std::cin >> all; if (all <= 0) { std::cout << " The number you entered is incorrect " << std::
endl; return -1; } buildHaffman(all); printCode(all); for (int i = 0; i < all;
++i) { std::cout << haffman[i].ch << ": Haffman Code is:"; for (int j = code[i].
start+ 1; j < all; ++j) { std::cout << code[i].code[j]; } std::cout << std::endl
; } return 0; }

Technology
Daily Recommendation