summaryrefslogtreecommitdiffstats
path: root/content/posts/formatting-dumped-subtitles.md
diff options
context:
space:
mode:
Diffstat (limited to 'content/posts/formatting-dumped-subtitles.md')
-rw-r--r--content/posts/formatting-dumped-subtitles.md7
1 files changed, 5 insertions, 2 deletions
diff --git a/content/posts/formatting-dumped-subtitles.md b/content/posts/formatting-dumped-subtitles.md
index 4250b8d..1d00c4a 100644
--- a/content/posts/formatting-dumped-subtitles.md
+++ b/content/posts/formatting-dumped-subtitles.md
@@ -1,6 +1,7 @@
---
title: "Formatting dumped subtitles into a vocabulary list"
date: 2020-05-28T16:52:00
+lastmod: 2020-07-05T17:59:00
tags : [ "Formats", "Languages", "Linux", "Media", "Snippets", "Software", ]
---
@@ -11,7 +12,7 @@ tr ' ' '\n' < subs.srt \
sed -e 's/<[^>]*>//g' \
tr '[:upper:]' '[:lower:]' \
tr -d '\>\/!-.:?,.\",[:digit:]' \
- sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' \
+ sed -e '/^[[:space:]]*$/d' -re 's/\s+$//' -re 's/\{...\}//' \
sort -u > subs-sort.srt
```
@@ -19,4 +20,6 @@ In short, this will break all spaces into new lines, remove HTML tags, make ever
One issue I've noticed is some _special_ characters won't be converted to lowercase &Aring; to &aring; for example. I don't have an automated workaround for you aside from specifying the letters individually for example using:
-<pre><code>tr '&AElig;&Oslash;&Aring;&Auml;&Ouml;&ETH;&THORN;&Aacute;&Eacute;&Iacute;&Oacute;&Uacute;&Yacute;' '&aelig;&oslash;&aring;&auml;&ouml;&eth;&thorn;&aacute;&eacute;&iacute;&oacute;&uacute;&yacute;'</pre></code> \ No newline at end of file
+<pre><code>tr '&AElig;&Oslash;&Aring;&Auml;&Ouml;&ETH;&THORN;&Aacute;&Eacute;&Iacute;&Oacute;&Uacute;&Yacute;' '&aelig;&oslash;&aring;&auml;&ouml;&eth;&thorn;&aacute;&eacute;&iacute;&oacute;&uacute;&yacute;'</pre></code>
+
+* **Edit 2020-07-05:** Added {\an} tag removal