summaryrefslogtreecommitdiffstats
path: root/content/posts/notes-on-sorting-paperwork.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts/notes-on-sorting-paperwork.md')
-rw-r--r--content/posts/notes-on-sorting-paperwork.md141
1 files changed, 141 insertions, 0 deletions
diff --git a/content/posts/notes-on-sorting-paperwork.md b/content/posts/notes-on-sorting-paperwork.md
new file mode 100644
index 0000000..7265fc5
--- /dev/null
+++ b/content/posts/notes-on-sorting-paperwork.md
@@ -0,0 +1,141 @@
+---
+title: "Notes on Sorting Paperwork"
+date: 2020-06-23T15:29:00
+tags: ["Formats", "Guides", "Minimalism", "Snippets", "Software", "Windows"]
+---
+
+# Intention
+
+The purpose of this whole project ties in with digitally having copies of things. This makes searching and sorting much, much easier.
+
+The only physical paperwork I keep are certain important documents (like birth certificates, physical driving licence, etc).
+
+# Hardware
+
+I'm using a Fujitsu ScanSnap iX500 as my scanner of choice, it was recommended by the Paperless project and it folds away neatly when not in use.
+
+I initially intended on this hardware being used wirelessly, which I'm sure it is capable of; but for the once a month or so I go through my paperwork I doubt it's worth the hassle setting it up.
+
+# Software (or the lack of)
+
+*Preface:* I'm doing all of this on Windows. I've attempted to get this working nicely on Linux, but my results scanning with `sane` were terrible, even when tweaking input it was slower, with worse quality, wouldn't feed multiple sheets and had to be manually initiated from the command line for every scan.
+
+So, on with the software. The only real software that's required is the [ScanSnap Manager from Fujitsu](http://scansnap.fujitsu.com/global/dl/). It's a little bulky but for it's girth it will (attempt to) auto-rotate and OCR the document with pretty high rate of success. Only around 5-10% of documents require any manual intervention in my experience.
+
+One point worth noting is the ScanSnap software will add two autorun entries to your system startup. They can be removed with the following commands:
+```
+reg delete "HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run" /v "ScanSnap WIA Service Checker" /f
+reg delete "HKEY_LOCAL_MACHINE\SOFTWARE\WOW6432Node\Microsoft\Windows\CurrentVersion\Run" /v "ScanSnap OnlineUpdate Watcher" /f
+```
+
+In the past, I used [Paperless](https://github.com/the-paperless-project/paperless) to automatically OCR and pseudo-sort my documents, but this had a high failure rate (~75% of documents couldn't be parsed), was quite _heavy-weight_ software and was complete overkill for what I needed.
+
+Nowadays, I just scan the files to a single PDF document and organise away in a folder structure. Files are generally sorted well enough that I don't need to ever manually search for file contents as I always have an idea of what I need. Worst case scenario I believe you can grep strings from OCR'd pdf files anyway.
+
+# Naming
+
+When it comes to naming my files, I follow a few rules to help keep things organised:
+
+- Include a brief summary in the filename (e.g. insurance-renewal-invitation)
+- Always use lower case characters
+- Replace spaces with hyphens
+- Use ISO dates where possible, or if you receive a biannual statement, name it 2019a and 2019b
+- For cars, reference the registration number
+
+# Duplicates
+
+If you've read my [notes on sorting pictures]() post you'll know that I try to remove duplicates, however with paperwork I don't bother de-duplicate the files for a few reasons:
+
+- Time/bandwidth taken to download files during duplicate search
+- Scans vs identical scans vs digital copies will never be identical
+- OCR isn't perfect and will produce different results base on the above
+- My files are generally sorted well enough that I'd be getting duplicate filenames.
+
+# File manipulation
+
+Once scanned, I rename all my files to their titles are applicable (see above), then I manually check each one for blank, rotated and out of order pages. I upload these to my server and then manually sort any files that require attention using `qpdf`. Below are some basic commands to work with documents:
+
+## Deleting
+```
+qpdf input.pdf --pages input.pdf 1-9,26 -- outputfile.pdf
+```
+
+## Splitting
+```
+qpdf input.pdf --pages input.pdf 1-2,4 -- outputfile1.pdf
+qpdf input.pdf --pages input.pdf 3,5-6 -- outputfile2.pdf
+```
+
+## Rotating
+```
+qpdf in out.pdf --rotate=180:1,4
+```
+
+## Decrypting
+```
+qpdf --password=yourpassword --decrypt input.pdf outputfile.pdf
+```
+
+## OCR
+For files that I didn't scan myself, sometimes they will require manual OCR'ing. For this I use [OCRmyPDF](https://github.com/jbarlow83/OCRmyPDF):
+```
+ocrmypdf -l eng input.pdf output_ocr.pdf
+```
+
+## Lower case filenames
+```
+mmv '*' '#l1'
+```
+
+# Folder Structure
+
+Lastly, and the key to keeping things easy to work with, my folder structure looks something like this:
+
+```
+.
+|-- computing
+| `-- provider
+|-- finances
+| `-- bank-name
+| |-- correspondance
+| `-- statements
+|-- household
+| |-- appliances
+| | `-- appliance
+| |-- council
+| |-- insurance
+| | `-- year provider
+| |-- purchase
+| |-- recycling
+| |-- renovation
+| |-- utilities
+| | `-- sorted by provider
+| `-- voting
+|-- personal
+| |-- business-cards
+| |-- certifications
+| |-- driving-licence
+| |-- gym
+| |-- travel
+| | `-- year location
+| `-- workplace
+| |-- contracts
+| |-- interviews
+| |-- p60
+| |-- pension
+| |-- tax
+| `-- wages
+`-- transport
+ |-- insurance
+ | `-- year registration
+ |-- mot
+ | `-- year registration
+ |-- purchases
+ |-- repairs
+ | `-- registration
+ |-- roadside-assistance
+ | `-- year provider
+ `-- road-tax
+```
+
+This is all kept backed up on cloud storage, as you'd expect. So finally, at the end of all this, we have sorted, digital-only paperwork, backed up online. \ No newline at end of file