Document Tampering Detection

Current Findings

1. With only 274 patches to train a deep learning setup is unable to learn features for document tampering detection. I have tried multiple setups such as different sampling rate, different architeures, different patch sizes. See Experiments Section
2. Data augmention and domain adaptation similiar to Yashas Setup for Image tampering increases the performance of the model but it is not sufficient to be used for detection.
3. Combination of non deep learning methods (CMFD, Jpeg Artifacts and Splicebuster ) seems to be performing better than the current setup.
4. A better statergy to create synthethic document tampering should boost performance.

Introduction

Document Tampering detection is to find the location of tampering in a document. Convolution networks provide good results on detection of image tampering but these network require large amount of training data. Thus, we want to see if using data augmentation and domain adaptation we can get better performance than using previous approaches.

Findit Dataset

For detection I are using the Find-it dataset [4]. The dataset is divided for 2 tasks classification and detection.
For detection it provides 100 images for training and 80 images for testing.

  • Images from dataset for Task 1. Classification

    For classification the datasets provides 500(470 pristine and 30 tampered) images for testing and 499(469 pristine and 30 tampered) images for testing.

    Tampered images


    Non tampered images

  • Images from dataset for Task 2. Detection

    For detection it provides 100 images for training and 80 images for testing.

Table below shows the Distribuition of types of tampering in the dataset



  • CPI (copy and paste inside the document)
  • CPO (copy and paste from an other document)
  • IMI (creation of a text box imitating the font)
  • CUT (deletion of one or more characters/words)
  • Other: drawing, copy and paste from web...

Method

Figure shows the setup for document tampering detection
Text Detection

First step is to detect the text in the image. For this I use ctpn-textdetector to extract text from the documents. To extract characters from the text, I use connectedcomponents.

Creating Patches

The input to the model are 64x64 patches. To extract patches from the image. We use the bounding boxes previously extracted . Then 64x64 patches are created by extracting the region around the text. This becomes the input to our model.

Class Imbalance

As a document generally contains large amount of text out of which only a few regions are tampered. The data becomes highly imbalanced. In the find-it dataset I noticed that on average only 3.0% of patches extracted from a document were tampered.

Experiments

Changing Sampling Ratio

As there are very less tampered patches compared to non tampered patches. The first experiment is to see the effect of sampling rate of different classes. Accordingly changing the loss function. Hence after creating patches based as mentioned above. Results on 3 different sampling rate are shown below.

Sampling Ratio( b/w tampered & non-tampered class ) Validation Patch Accuracy Validation F1-score
1-1 0.657 0.603
1-5 0.680 0.624
1-10 0.664 0.608

Amount of data augmentation

As tampered images are so low. It makes it very hard to train a CNN architecture with such data. As shown by Yashas data augmentation improves the performance of deep learning models on image tampering classifiation.
Synthetic Data creation

We create synthetic tampered images using 3 types of tampering.

  • 1. Copy-paste or CPI
  • 2. Splicing or CPO
  • 3. In-painting or CUT
  • Examples of augmented images created.

    Copy-Paste

    Splicing

    Inpainting

Find-it dataset provides 470 pristine images for classification. To create tampered images we first use a ctpn-textdetector [3] to extract text from the documents. To extract characters from the text, we use connectedcomponents. Finally different types of tampering are created by replacing the text region.Thus, we create total of 6000 synthetic images.

Domain Adapatation

As seen in the figure below during training source model and target model are trained simultaneously. The weights of the CNN block between them are shared. Next they are passed through 2 fully connected layers, to get a 256-dim representation for source and target images. For domain adapation Yashas defines an MMD loss between the 2 representations. The MMD loss defined as:

where xs and xt are features for the source and target images respectively. Φ(xs) and Φ(xt) are the calculated by passing the representation through Gaussian kernel

Number of synthetic patches Source Training F1-score Source Validation F1-score Target Training data F1-score Target Validation F1-score
No synthetic data - - 0.680 0.624
2000 Patches 0.968 0.951 0.962 0.734
6000 0.980 0.971 0.973 0.782
Loss & Accuracy plots vs epochs
  • Training & Validation loss with and without using augmented data
  • Training and Validation accuracy with and without using augmented data

    Validation accuracy on the find-it images is shown by pink color.

Changing CNN architecture

We futher tested whether changing the architecture would improve the performance. From the table below we can infer that resnet type model are performing slightly better than VGG setup used by Yashas

Number of Convolution operations Source Training F1-score Source Validation F1-score Target Training data F1-score Target Validation F1-score
3 block 0.902 0.898 0.87 0.702
5 blocks (Yashas Setup) 0.973 0.961 0.953 0.762
18 block (Resnet Setup) 0.980 0.971 0.973 0.782
We notice a marginal improvement by changing the architecture.
Testing

For testing we have taken the model with the highest average IOU over the validation data. We compare it with the setup used by Verdolivia. They use a 3 techniques (a) CMFD for CPI, (b) Splicebuster for CPO and (c) Jpeg Artifacts. I have taken the union of all 3 masks and compared it with the ground truth to evaluate with their results.

Average IOU on test data
Without Augmentation 0.00138
With Augmentation 0.00178
CMFD + Noiseprint + Jpeg Artifacts 0.516
Qualtitative Results on Test set

For more results go to this link