There are many articles about dealing with recovering space on your computer / hard drive / Storage and they all talk about the same things. This article isn’t going to so much focus on that, but simply focus on the not so obvious things you can do.
As a rule of thumb, there are always the usual suspects that are easy to do for recovering space:
1. Clear your temporary Internet files (Browser Caches)
2. Search for lost clusters (run scandisk)
3. Check and repair your registry.
4. Archive and backup data you haven’t accessed in over a year.
Let’s concentrate on the other and one most important thing you can do. Most people don’t realize it, but the reality is we keep multiple copies of the same file in multiple places on our computer. If you’re anything like me, good luck trying to find all of them.
It’s called deduplication. Not exactly an easy thing to do, because it’s incredibly time consuming. Many people don’t realize exactly how many copies of emails, documents, etc, they have on their hard drives and all they do is just upgrade to a bigger hard drive bringing the baggage along with them.
Well, there is a simple solution. It involves simply (or not so simply) iterating through every file on your hard drive and identifying the duplicates. This is a very painful process and can take weeks if done by hand. Even with my expertise, it still took 9 hours on a computer with 8 cores (i7), coupled with 16 GB of RAM equipped with an SSD drive. While the process was slow, manual, and painful, it did manage to free up over 150 GB of space on a 512 GB SSD. Was it worth it? I’d say yes.
So the question is how does one do this? It’s going to require a bit of work, but I’ll explain the concept to you and let you decide whether you’d like to undertake the process.
1. I first installed a SQL database on my computer. I actually booted into CentOS, mounted the filesystem and then ran my queries against it.
2. Create a DB table with a couple of fields. The most important one though is to actually save the absolute path and filenames in one field and the MD5 checksum we’ll be generating for each file.
3. Now, we’re going to iterate through the entire mounted filesystem and generate the MD5 checksums for every file on the filesystem. If you’d like to get creative (what I did), I also saved the last modified date of the file as well. (More on that in a moment).
4. After the iteration process, you’re going to have a ridiculously large dataset.
5. Sort the table data by MD5 checksum, then in descending order, the last modified date.
6. You should see many files that are exactly the same (MD5) checksummed. Just delete the ones that are the oldest and only keep the newest (by last modified date).
Once you’ve completed this process, you’re done. You now have only one copy of each file on your computer.
If I get enough donations, I’ll make this a Java Application and post it for free so it can run on any platform.
It wasn’t an easy or fun task, but going from 500+ GB of data down to 350-ish GB of data without losing anything is a pretty impressive way to go!
I can do this for you if you would like, but it would be purely on a consulting basis or you can just donate and when I get enough donations to actually cover the time to write the application, I’ll do it and have it out there for anyone to download and use for free.
That is a promise!
Hope this little ditty helps those that actually want to recover a significant amount of space.
The next step is to do this on my Pegasus Array which has over 5 TB of data. (Yay!!!)
I should also mention that the performance increase on your computer will go through the roof!!! Alot less I/O overhead dealing with stuff that you never needed anyway.