Features selection : Chi-square ~ Jean-Sébastien Carinos

Features selection : Chi-square

\tilde{\chi}^2 test the independence of two events :

A : occurrence of the term
B : occurrence of the class

The higher Chi-square is, the more independant are the term and the class.

When classifying documents with a lot of different words, we can use the chi-square to only use a subset of the words. Training become less expansive and the classifier may also become more accurate by denoising our dataset

Definition

For a term t and a class c, we can compute :

\tilde{\chi}^2 = \sum\nolimits_{t \in [0;1]}\sum\nolimits_{c \in [0;1]}((N_{tc} -E_{tc})^2/E_{tc})

That we can developp in :

\tilde{\chi}^2 = \dfrac{(N_{00}+N_{01}+N_{10}+N_{11})*(N_{00}N_{11}-N_{01}N_{10})^2}{(N_{00}+N_{01})*(N_{00}+N_{10})*(N_{01}+N_{11})*(N_{10}+N_{11})}

N : observed frequency
E : estimated frequency

Computation

N is observed, we know it by watching the occurrence of t and c in our documents

N_{10} is the total number of docs
N_{10} is the number of docs with the term t not of the class c
N_{11} is the number of docs with the term t of the class c
N_{00} is the number of docs without the term t not of the class c
N_{01} is the number of docs without the term t of the class c

E has to be computed

We will use that table

	c=0	c=1
t=0	N₀₀	N₀₁
t=1	N₁₀	N₁₁

To compute E₁₁ , we sum its row (N₁₀+N₁₁), sum its column (N₀₁+N₁₁), multiply the results and divide it by N

E_{11}=\dfrac{(N_{10}+N_{11})*(N_{01}+N_{11})}{N}

And We do the same for the others E_tc.

E_{00}=\dfrac{(N_{00}+N_{01})*(N_{00}+N_{10})}{N}

E_{01}=\dfrac{(N_{00}+N_{01})*(N_{01}+N_{11})}{N}

E_{10}=\dfrac{(N_{10}+N_{11})*(N_{00}+N_{10})}{N}

Confidence

We can consider a term not helpful to determine a class if they are independent (Chi-square is high).
The following table indicate how confident we can be when rejecting a term according to its \tilde{\chi}^2 .

Confidence	\tilde{\chi}^2
99.900%	2.71
99.950%	3.84
99.990%	6.63
99.995%	7.88
99.999%	10.83

Ie, if you reject a term with a chi square greater than 7.88, you only have 0.005% chances to be wrong.