,

Features selection : Chi-square

Features selection : Chi-square


\tilde{\chi}^2 test the independence of two events :

A : occurrence of the term
B : occurrence of the class

The higher Chi-square is, the more independant are the term and the class.

When classifying documents with a lot of different words, we can use the chi-square to only use a subset of the words. Training become less expansive and the classifier may also become more accurate by denoising our dataset

Definition


For a term t and a class c, we can compute :

\tilde{\chi}^2 = \sum\nolimits_{t \in [0;1]}\sum\nolimits_{c \in [0;1]}((N_{tc} -E_{tc})^2/E_{tc})

That we can developp in :

\tilde{\chi}^2 = \dfrac{(N_{00}+N_{01}+N_{10}+N_{11})*(N_{00}N_{11}-N_{01}N_{10})^2}{(N_{00}+N_{01})*(N_{00}+N_{10})*(N_{01}+N_{11})*(N_{10}+N_{11})}

N : observed frequency
E : estimated frequency

Computation


N is observed, we know it by watching the occurrence of t and c in our documents

N_{10} is the total number of docs
N_{10} is the number of docs with the term t not of the class c
N_{11} is the number of docs with the term t of the class c
N_{00} is the number of docs without the term t not of the class c
N_{01} is the number of docs without the term t of the class c

E has to be computed

We will use that table

c=0c=1
t=0N00N01
t=1N10N11


To compute E11 , we sum its row (N10+N11), sum its column (N01+N11), multiply the results and divide it by N

E_{11}=\dfrac{(N_{10}+N_{11})*(N_{01}+N_{11})}{N}

And We do the same for the others Etc.
E_{00}=\dfrac{(N_{00}+N_{01})*(N_{00}+N_{10})}{N}
E_{01}=\dfrac{(N_{00}+N_{01})*(N_{01}+N_{11})}{N}
E_{10}=\dfrac{(N_{10}+N_{11})*(N_{00}+N_{10})}{N}

Confidence


We can consider a term not helpful to determine a class if they are independent (Chi-square is high).
The following table indicate how confident we can be when rejecting a term according to its \tilde{\chi}^2 .

Confidence \tilde{\chi}^2
99.900%2.71
99.950%3.84
99.990%6.63
99.995%7.88
99.999%10.83


Ie, if you reject a term with a chi square greater than 7.88, you only have 0.005% chances to be wrong.