Features selection : Chi-square
\tilde{\chi}^2 test the independence of two events :
A : occurrence of the term
B : occurrence of the class
The higher Chi-square is, the more independant are the term and the class.
When classifying documents with a lot of different words, we can use the chi-square to only use a subset of the words. Training become less expansive and the classifier may also become more accurate by denoising our dataset
Definition
For a term t and a class c, we can compute :
\tilde{\chi}^2 = \sum\nolimits_{t \in [0;1]}\sum\nolimits_{c \in [0;1]}((N_{tc} -E_{tc})^2/E_{tc})
That we can developp in :
\tilde{\chi}^2 = \dfrac{(N_{00}+N_{01}+N_{10}+N_{11})*(N_{00}N_{11}-N_{01}N_{10})^2}{(N_{00}+N_{01})*(N_{00}+N_{10})*(N_{01}+N_{11})*(N_{10}+N_{11})}
N : observed frequency
E : estimated frequency
Computation
N is observed, we know it by watching the occurrence of t and c in our documents
N_{10} is the total number of docs
N_{10} is the number of docs with the term t not of the class c
N_{11} is the number of docs with the term t of the class c
N_{00} is the number of docs without the term t not of the class c
N_{01} is the number of docs without the term t of the class c
E has to be computed
We will use that table
c=0 | c=1 | |
t=0 | N00 | N01 |
t=1 | N10 | N11 |
To compute E11 , we sum its row (N10+N11), sum its column (N01+N11), multiply the results and divide it by N
E_{11}=\dfrac{(N_{10}+N_{11})*(N_{01}+N_{11})}{N}
And We do the same for the others Etc.
E_{00}=\dfrac{(N_{00}+N_{01})*(N_{00}+N_{10})}{N}
E_{01}=\dfrac{(N_{00}+N_{01})*(N_{01}+N_{11})}{N}
E_{10}=\dfrac{(N_{10}+N_{11})*(N_{00}+N_{10})}{N}
Confidence
We can consider a term not helpful to determine a class if they are independent (Chi-square is high).
The following table indicate how confident we can be when rejecting a term according to its \tilde{\chi}^2 .
Confidence | \tilde{\chi}^2 |
99.900% | 2.71 |
99.950% | 3.84 |
99.990% | 6.63 |
99.995% | 7.88 |
99.999% | 10.83 |
Ie, if you reject a term with a chi square greater than 7.88, you only have 0.005% chances to be wrong.