Add option for automatic subtitle character encoding normalization (#68)

* Add option for automatic subtitle character encoding normalization

The rationale behind this function is that some services use ISO-8859-1
(latin1) or Windows-1252 (CP-1252) instead of UTF-8 encoding, whether
intentionally or accidentally. Some services even stream subtitles with
malformed/mixed encoding (each segment has a different encoding).

* Remove Subtitle parameter `auto_fix_encoding`

Just always attempt to fix encoding. If the subtitle is neither UTF-8 nor CP-1252, then it should realistically error out instead of producing garbage Subtitle data anyway.

* Move Subtitle encoding fixing code out of if drm tree

* Use chardet as a last ditch effort fixing Subs, or return original data

* Move Subtitle.fix_encoding method to utilities as try_ensure_utf8

* Add Shivelight as a contributor

---------

Co-authored-by: rlaphoenix <rlaphoenix@pm.me>

This commit is contained in:

Shivelight

2023-12-02 19:00:55 +08:00

committed by

GitHub

parent 4b8cfabaac

commit c31ee338dc

7 changed files with 58 additions and 6 deletions

									
										1

pyproject.toml
									
												View File
												
				@@ -60,6 +60,7 @@ sortedcontainers = "^2.4.0"

				subtitle-filter = "^1.4.6"

				Unidecode = "^1.3.6"

				urllib3 = "^2.0.4"

				chardet = "^5.2.0"

				[tool.poetry.dev-dependencies]

				pre-commit = "^3.4.0"

Add option for automatic subtitle character encoding normalization (#68)

1 pyproject.toml Unescape Escape View File

1

pyproject.toml

View File