diff options
author | breadcat | 2020-06-19 12:23:15 +0100 |
---|---|---|
committer | breadcat | 2020-06-19 12:23:15 +0100 |
commit | 70bb5d5a801428b0fb390abf79f19ffcf5e29c67 (patch) | |
tree | b9fd7990156bd58bc38d58f91829c05933215102 /content/posts/formatting-dumped-subtitles.md | |
parent | 0f9a31348079c0a061bcc194912e75cc1c07bc1f (diff) | |
download | blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.gz blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.tar.bz2 blog.minskio.co.uk-70bb5d5a801428b0fb390abf79f19ffcf5e29c67.zip |
Simple migration of existing posts to hugo format
Diffstat (limited to 'content/posts/formatting-dumped-subtitles.md')
-rw-r--r-- | content/posts/formatting-dumped-subtitles.md | 22 |
1 files changed, 22 insertions, 0 deletions
diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md new file mode 100644 index 0000000..509061b --- /dev/null +++ b/content/posts/formatting-dumped-subtitles.md @@ -0,0 +1,22 @@ +--- +title: "Formatting dumped subtitles into a vocabulary list" +date: 2020-05-28T16:52:00 +tags : [ "formats", "languages", "linux", "media", "snippets", "software", ] +--- + +As per my previous post, you should now have a single `srt` subtitle file, to convert this into a single word list that you can begin translating away at, you can run the below verbose script. + +``` +tr ' ' '\n' < subs.srt \ + sed -e 's/<[^>]*>//g' \ + tr '[:upper:]' '[:lower:]' \ + tr -d '\>\/!-.:?,.\",[:digit:]' \ + sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \ + sort -u > subs-sort.srt +``` + +In short, this will break all spaces into new lines, remove HTML tags, make everything lowercase, remove some strange characters and empty lines then finally sort the list while removing duplicates. + +One issue I've noticed is some _special_ characters won't be converted to lowercase Å to å for example. I don't have an automated workaround for you aside from specifying the letters individually for example using: + +<pre><code>tr 'ÆØÅÄÖÐÞÁÉÍÓÚÝ' 'æøåäöðþáéíóúý'</pre></code>
\ No newline at end of file |