1. With only 274 patches to train a deep learning setup is unable to learn features for document tampering detection. I have tried multiple setups such as different sampling rate, different architeures, different patch sizes. See Experiments Section |
2. Data augmention and domain adaptation similiar to Yashas Setup for Image tampering increases the performance of the model but it is not sufficient to be used for detection. |
3. Combination of non deep learning methods (CMFD, Jpeg Artifacts and Splicebuster ) seems to be performing better than the current setup. |
4. A better statergy to create synthethic document tampering should boost performance. |
Document Tampering detection is to find the location of tampering in a document. Convolution networks provide good results on detection of image tampering but these network require large amount of training data. Thus, we want to see if using data augmentation and domain adaptation we can get better performance than using previous approaches.
For detection I are using the Find-it dataset [4]. The dataset is divided for 2 tasks classification and detection.
For detection it provides 100 images for training and 80 images for testing.
For classification the datasets provides 500(470 pristine and 30 tampered) images for testing and 499(469 pristine and 30 tampered) images for testing.
Tampered images
Non tampered images
For detection it provides 100 images for training and 80 images for testing.
Table below shows the Distribuition of types of tampering in the dataset
First step is to detect the text in the image. For this I use ctpn-textdetector to extract text from the documents. To extract characters from the text, I use connectedcomponents.
The input to the model are 64x64 patches. To extract patches from the image. We use the bounding boxes previously extracted . Then 64x64 patches are created by extracting the region around the text. This becomes the input to our model.
As a document generally contains large amount of text out of which only a few regions are tampered. The data becomes highly imbalanced. In the find-it dataset I noticed that on average only 3.0% of patches extracted from a document were tampered.
As there are very less tampered patches compared to non tampered patches. The first experiment is to see the effect of sampling rate of different classes. Accordingly changing the loss function. Hence after creating patches based as mentioned above. Results on 3 different sampling rate are shown below.
Sampling Ratio( b/w tampered & non-tampered class ) | Validation Patch Accuracy | Validation F1-score |
---|---|---|
1-1 | 0.657 | 0.603 |
1-5 | 0.680 | 0.624 |
1-10 | 0.664 | 0.608 |
We create synthetic tampered images using 3 types of tampering.
Copy-Paste
Splicing
Inpainting
Find-it dataset provides 470 pristine images for classification. To create tampered images we first use a ctpn-textdetector [3] to extract text from the documents. To extract characters from the text, we use connectedcomponents. Finally different types of tampering are created by replacing the text region.Thus, we create total of 6000 synthetic images.
As seen in the figure below during training source model and target model are trained simultaneously. The weights of the CNN block between them are shared. Next they are passed through 2 fully connected layers, to get a 256-dim representation for source and target images. For domain adapation Yashas defines an MMD loss between the 2 representations. The MMD loss defined as:
where xs and xt are features for the source and target images respectively. Φ(xs) and Φ(xt) are the calculated by passing the representation through Gaussian kernel
Number of synthetic patches | Source Training F1-score | Source Validation F1-score | Target Training data F1-score | Target Validation F1-score |
---|---|---|---|---|
No synthetic data | - | - | 0.680 | 0.624 |
2000 Patches | 0.968 | 0.951 | 0.962 | 0.734 |
6000 | 0.980 | 0.971 | 0.973 | 0.782 |
Validation accuracy on the find-it images is shown by pink color.
We futher tested whether changing the architecture would improve the performance. From the table below we can infer that resnet type model are performing slightly better than VGG setup used by Yashas
Number of Convolution operations | Source Training F1-score | Source Validation F1-score | Target Training data F1-score | Target Validation F1-score |
---|---|---|---|---|
3 block | 0.902 | 0.898 | 0.87 | 0.702 |
5 blocks (Yashas Setup) | 0.973 | 0.961 | 0.953 | 0.762 |
18 block (Resnet Setup) | 0.980 | 0.971 | 0.973 | 0.782 |
For testing we have taken the model with the highest average IOU over the validation data. We compare it with the setup used by Verdolivia. They use a 3 techniques (a) CMFD for CPI, (b) Splicebuster for CPO and (c) Jpeg Artifacts. I have taken the union of all 3 masks and compared it with the ground truth to evaluate with their results.
Average IOU on test data | |
---|---|
Without Augmentation | 0.00138 |
With Augmentation | 0.00178 |
CMFD + Noiseprint + Jpeg Artifacts | 0.516 |
For more results go to this link