Simple migration of existing posts to hugo format

author: breadcat 2020-06-19 12:23:15 +0100
committer: breadcat 2020-06-19 12:23:15 +0100
commit: 70bb5d5a801428b0fb390abf79f19ffcf5e29c67 (patch)
tree: b9fd7990156bd58bc38d58f91829c05933215102 /content/posts/formatting-dumped-subtitles.md
parent: 0f9a31348079c0a061bcc194912e75cc1c07bc1f (diff)
download: blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.gz
blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.bz2
blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.zip
1 files changed, 22 insertions, 0 deletions
diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md
new file mode 100644
index 0000000..509061b
--- /dev/null
+++ b/content/posts/formatting-dumped-subtitles.md
@@ -0,0 +1,22 @@
+---
+title: "Formatting dumped subtitles into a vocabulary list"
+date: 2020-05-28T16:52:00
+tags : [ "formats", "languages", "linux", "media", "snippets", "software", ]
+---
+
+As per my previous post, you should now have a single `srt` subtitle file, to convert this into a single word list that you can begin translating away at, you can run the below verbose script.
+
+```
+tr ' ' '\n' < subs.srt \ 
+	sed -e 's/<[^>]*>//g' \ 
+	tr '[:upper:]' '[:lower:]' \ 
+	tr -d '\>\/!-.:?,.\",[:digit:]' \ 
+	sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \ 
+	sort -u > subs-sort.srt
+```
+
+In short, this will break all spaces into new lines, remove HTML tags, make everything lowercase, remove some strange characters and empty lines then finally sort the list while removing duplicates.
+
+One issue I've noticed is some _special_ characters won't be converted to lowercase &Aring; to &aring; for example. I don't have an automated workaround for you aside from specifying the letters individually for example using:
+
+<pre><code>tr '&AElig;&Oslash;&Aring;&Auml;&Ouml;&ETH;&THORN;&Aacute;&Eacute;&Iacute;&Oacute;&Uacute;&Yacute;' '&aelig;&oslash;&aring;&auml;&ouml;&eth;&thorn;&aacute;&eacute;&iacute;&oacute;&uacute;&yacute;'</pre></code>
+\ No newline at end of file
author	breadcat	2020-06-19 12:23:15 +0100
committer	breadcat	2020-06-19 12:23:15 +0100
commit	70bb5d5a801428b0fb390abf79f19ffcf5e29c67 (patch)
tree	b9fd7990156bd58bc38d58f91829c05933215102 /content/posts/formatting-dumped-subtitles.md
parent	0f9a31348079c0a061bcc194912e75cc1c07bc1f (diff)
download	blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.gz blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.bz2 blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.zip