Pages

Thursday, January 9, 2014

Compressed ZIP IO Streams using java

Java provides the java.util.zip package for data compression in a zip-compatible format. Classes in this package allow you to manipulate zip files and create zip archives. Java uses specialized input/output streams to read/write compressed data. These streams can be used to send data to any destination in a compressed format.

The java.util.zip package provides a ZipInputStream for reading the compressed data, i.e., zip files and a ZipOutputStream for writing data in a compressed format. Both of these IO Streams are implementations of Filter Input/Output streams and in terns extend the java.io.InputStream and java.io.OutputStream in the hierarchy.

This interesting inheritance hierarchy makes it every easy for us to use both of these Zip IO Streams as wrappers to any IO Streams. For example, it would be very easy to write a custom RMI Socket to transfer compressed data over the network. For example, here is the code to read data from a file, compress it and send it over to another server using a socket:

 // Create an input stream for reading file data
 InputStream in = new FileInputStream(file);
 
 // Create a socket on a remote server
 Socket socket = new Socket(hostName, portNumber);
 
 // Acquire the remote socket to write data to
 OutputStream out = socket.getOutputStream();
 
 // Create a ZIP output stream to compress data
 ZipOutputStream zos = new ZipOutputStream(out);
 
 //  Open and start reading from the file
 int len;
    while ((len = in.read(buffer)) > 0) {
  // Send data using ZIP output stream (compressed)
     zos.write(buffer, 0, len);
    }

Similarly on the server side, you need to wrap the input stream in a ZipInputStream to read the compressed data. This approach will significantly reduce the network traffic if there is a requirement to send large amount of data over remote systems.

For the purpose of illustrating the end to end working, I used the file IO Streams to demonstrate how the ZIP package is used. There is no reason however, why you cannot replace the java.io.FileInputStream and the java.io.FileOutputStream with any type of IO Streams you may need to use for your project.

For the purpose of this post, I created a utility class ZipUtil, that has three methods:

  • public void createArchive: This method is used to create a new ZIP file containing all files in a given folder.
  • public void extractArchive: This method is used to extract all the files from a given ZIP archive.
  • private List<String> listFiles: This is a private method to generate a list of all files that can be compressed. This method is used internally by createArchive() method.
Let's look at all of these methods in detail:

Creating ZIP Archive

First we shall see how we can create ZIP Archives using the classes provided in the java.util.zip package.

 public void createArchive(File sourceDir, File zipFile)throws Exception{
 
  String root = sourceDir.getAbsolutePath();
  
  // Needed for listFiles method
  int rootLen = root.length() + 1;
  List<String> files = listFiles(rootLen, root, zipFile.getName());
  
  ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));
  
  for (String file : files) {
   // Create a new zip entry and add it to the file 
   zos.putNextEntry(new ZipEntry(file));
   
   // Open a read the file data to compress in zip format
   InputStream in = new FileInputStream(root + File.separator + file);
   int len;
   while ((len = in.read(buffer)) > 0) {
    zos.write(buffer, 0, len);
   }
   in.close();
  }
  zos.close();
 }

As you can see, it is a very simple method that reads a folder and creates a zip entry for every file in the given folder. The only tricky bit in the code is actually getting the list of all the files that are to be archived in the new zip file. I shall explain that later. Other then that, all we are doing here is reading data from each file in the folder and writing that to the ZipOutputStream, which will compress that data for you before storing it to the disk in the given zip file.

Now let's see the listFiles method:

 private List<String> listFiles(int rootLen, String path, String zipFile) {

  List<String> files = new ArrayList<String>();
  
  File root = new File( path );
  File[] list = root.listFiles();

  // see if the directory is empty
  if (list == null) 
   return files;
  
  // traverse through each file in the directory
  for ( File f : list ) {
   if ( f.isDirectory() ) {
    
    // In case it is a directory then call recursively to 
    // get a list of all the files in the sub directory
    files.addAll(listFiles(rootLen, f.getAbsolutePath(), zipFile));
   }
   else {
    
    // Skip the zip file if it already exists anywhere in the directory tree
    // otherwise that can result in a deadlock
    if(f.getName().equals(zipFile))
     continue;
    
    // We need the relative pats in the archive, remove the parent directories. 
    files.add(f.getAbsolutePath().substring(rootLen));
   }
  }
  return files;
 }

It is a very simple method to get the list of all the files in the given directory recursively. For that, simply get the list of all files in a directory, go through the list of all the files one by one, if the file is a file then add it in the list to return, if the file is a directory then call the same method recursively with the subdirectory. That way, we will get the list of all the files in the full directory tree.

There are, however, a few minor details that we need to consider. Notice the two extra parameters to this method, rootLen and zipFile. Both of them are there for a special reason. The root length is the length of the root path of the parent folder provided to the createArchive() method. The class java.io.File only provides either the full path of a file or just the file name, both of these are not usable for the archive because we have to maintain the folder hierarchy within the zip archive, relative to the root folder path.

The workaround is to get the full path of each file and then remove the root path from the absolute path of the file, this way we will end up with the path of the files in sub folders, relative to the root path of the parent folder. Simply because String manipulation is comparatively expensive, we pass the length of the root path in rootLen parameter and remove that number of characters from the absolute path of each file, leaving only the relative paths with the names.

Another important consideration is that if the zip file already exists (anywhere in the given directory tree) it can create a deadlock. We have already created an OutputStream for the file and we will end up trying to open an InputStream on the same file, which can result in a deadlock. To avoid that problem we pass the name of the zip file in zipFile parameter and remove the zip file from the returned list if it already exists.

Extracting a ZIP Archive

We just saw how easy it is to create a ZIP Archive using the utility classes provided by the java.util.zip package. Now let's see how to extract files from a ZIP Archive to a folder.

 public void extractArchive(File archive, File outputDir) throws Exception{
 
  // Open the ZIP file
  ZipInputStream zis = new ZipInputStream(new FileInputStream(archive));
  
  // Start traversing the Archive
  ZipEntry ze;
  while ((ze = zis.getNextEntry()) != null){
   
   // Create a new file for each zip entry
   File f = new File(outputDir, ze.getName());
   
   // Create all folder needed to store in correct relative path.
   f.getParentFile().mkdirs();
   
   OutputStream os = new FileOutputStream(f);
   int len;
   while ((len = zis.read(buffer)) > 0) {
    os.write(buffer, 0, len);
   }
   os.close();
  }
  zis.close();
 }

This is the other way around in comparison to the createArchive() method we saw earlier. The compressed ZIP files must be opened using the ZipInputStream and the data read using this stream can be written back to any type of stream.

The only tricky bit here is the f.getParentFile().mkdirs(); part, this is because we must create subdirectories if there are any in the zip archive. Remember, a ZIP file may contain a full directory tree.

Testing the Example

To demonstrate the working of these utility methods, I created a test class and a folder with some test data. We shall first create a zip file using that test data and then extract the zip file in a given folder.

Here is the test class:

package zip;

import java.io.File;

public class MyZipTest {

 private static final String ZIP_FILE = "C:/temp/zipTest/testFiles/test.zip";
 private static final String SOURCE_FOLDER = "C:/temp/zipTest/testFiles";
 private static final String OUTPUT_FOLDER = "C:/temp/zipTest/unzip";

 public static void main(String arg[]){

  File archive = new File(ZIP_FILE);
  File srcDir = new File(SOURCE_FOLDER);
  File destDir = new File(OUTPUT_FOLDER);

  ZipUtil tst = new ZipUtil();
  try{
   // To create a zip archive
   tst.createArchive(srcDir, archive);
   
   // To extract the zip archive
   tst.extractArchive(archive, destDir);
  }catch(Exception e){
   e.printStackTrace();
  }
 }
}

Here is the utility class:

package zip;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.ArrayList;
import java.util.List;
import java.util.zip.ZipEntry;
import java.util.zip.ZipInputStream;
import java.util.zip.ZipOutputStream;

public class ZipUtil {

 byte[] buffer;
 
 public ZipUtil(){
  buffer = new byte[1024];
 }
 
 /**
  * Utility method to create a zip archive of all the files in a given folder. 
  * The method takes the full qualified name of the folder and the name 
  * of the zip file
  * 
  * @param sourceDir -- Directory to archive
  * @param zipFile -- Archive name
  * @throws Exception
  */
 public void createArchive(File sourceDir, File zipFile)throws Exception{

  String root = sourceDir.getAbsolutePath();
  
  if(!sourceDir.exists())
   throw new IOException("Source Directory " + root + " does not Exists");
  
  if(!sourceDir.isDirectory())
   throw new IllegalArgumentException(root + " is not a valid directory");
  
  // Needed for listFiles method
  int rootLen = root.length() + 1;
  List<String> files = listFiles(rootLen, root, zipFile.getName());
  
  if (files.size()<=0)
   throw new IllegalArgumentException(root + " is not empty");

  ZipOutputStream zos = new ZipOutputStream(new FileOutputStream(zipFile));
  
  System.out.println("Archiving: " + root);
  
  for (String file : files) {
   
   System.out.println("Ading: " + file);
   
   // Create a new zip entry and add it to the file 
         zos.putNextEntry(new ZipEntry(file));
   
         // Open a read the file data to compress in zip format
   InputStream in = new FileInputStream(root + File.separator + file);
   int len;
         while ((len = in.read(buffer)) > 0) {
          zos.write(buffer, 0, len);
         }
         in.close();
  }
  zos.close();
  System.out.println("Created: " + zipFile);
 }
 
 /**
  * This private method is used by the <code> public void createArchive(File sourceDir, File zipFile) </code> <br /> 
  * This method will generate the list of all the files in a given directory tree, with relative paths 
  * of the files in the sub directories if there are any. <br />
  * The third parameter is the name of the file that this method will ignore while generating the list. 
  * @param rootLen -- Length of the root directory path
  * @param path -- Directory to traverse
  * @param zipFile -- Name of the file to avoid 
  * @return
  */
 private List<String> listFiles(int rootLen, String path, String zipFile) {

  List<String> files = new ArrayList<String>();
  
  File root = new File( path );
  File[] list = root.listFiles();

  // see if the directory is empty
  if (list == null) 
   return files;
  
  // traverse through each file in the directory
  for ( File f : list ) {
   if ( f.isDirectory() ) {
    
    // In case it is a directory then call recursively to 
    // get a list of all the files in the sub directory
    files.addAll(listFiles(rootLen, f.getAbsolutePath(), zipFile));
   }
   else {
    
    // Skip the zip file if it already exists anywhere in the directory tree
    // otherwise that can result in a deadlock
    if(f.getName().equals(zipFile))
     continue;
    
    // We need the relative pats in the archive, remove the parent directories. 
    files.add(f.getAbsolutePath().substring(rootLen));
   }
  }
  return files;
 }
 
 /**
  * Utility method to extract all the files in a ZIP Archive to the given directory. 
  * If the ZIP file contains a complete directory tree, the method will take care of 
  * creating sub directories and placing files in appropriate locations within the 
  * sub directories. 
  * @param archive -- Zip Archive
  * @param outputDir -- Output directory for extracted files 
  * @throws Exception
  */
 public void extractArchive(File archive, File outputDir) throws Exception{
  
  String root = outputDir.getAbsolutePath();
  
  if(!outputDir.exists()){
   outputDir.mkdir();
  }
  
  if(!outputDir.isDirectory())
   throw new IllegalArgumentException(root + " is not a valid directory");

  System.out.println("Processing: " + archive.getAbsolutePath());
  System.out.println("Extracting files to: " + root);
  
  // Open the ZIP file
  ZipInputStream zis = new ZipInputStream(new FileInputStream(archive));
  
  // Start traversing the Archive
  ZipEntry ze;
  while ((ze = zis.getNextEntry()) != null){
   
   System.out.println("Extracting: " + ze);
   
   // Create a new file for each zip entry
   File f = new File(outputDir, ze.getName());
   // Create all folder needed to store in correct relative path.
            f.getParentFile().mkdirs();
   
            OutputStream os = new FileOutputStream(f);
            
   int len;
         while ((len = zis.read(buffer)) > 0) {
          os.write(buffer, 0, len);
         }
   os.close();
  }
  
  zis.close();
  
  System.out.println("Successfully Extracted: " + archive.getAbsolutePath());
 }
}

And this is the output when you run the test class to create a zip archive.

Archiving: C:\temp\zipTest\testFiles
Ading: first\TestText-1.txt
Ading: first\TestText-2.txt
Ading: first\TestText-3.txt
Ading: second\TestText-4.txt
Ading: second\TestText-5.txt
Ading: second\TestText-6.txt
Ading: second\TestText-7.txt
Ading: TestText-10.txt
Ading: TestText-8.txt
Ading: TestText-9.txt
Created: C:\temp\zipTest\testFiles\test.zip

And this is the output when you run the test class to extract a zip archive.

Processing: C:\temp\zipTest\testFiles\test.zip
Extracting files to: C:\temp\zipTest\unzip
Extracting: first\TestText-1.txt
Extracting: first\TestText-2.txt
Extracting: first\TestText-3.txt
Extracting: second\TestText-4.txt
Extracting: second\TestText-5.txt
Extracting: second\TestText-6.txt
Extracting: second\TestText-7.txt
Extracting: TestText-10.txt
Extracting: TestText-8.txt
Extracting: TestText-9.txt
Successfully Extracted: C:\temp\zipTest\testFiles\test.zip

As you can see, the technique can be used to compress any type of data streams. This may significantly reduce the bandwidth requirements of your system and also can dramatically improve the performance.

However, there are a few pitfalls there. You are adding some extra processing before sending data over any stream, and also you have to process data on the other end before you can use it. This extra processing time may not be justifiable in most of the cases, and you may even end up slowing down your system rather then gaining any performance benefits.

As every powerful tool, this must be approached with caution. This approach is desired only when you are transferring large amounts of data on a single request, and a data compression would make a difference. If you are sending small data chunks then it may not give you any benefits, or worse, may even introduce more problems.