├── .dockerignore ├── .flake8 ├── .gitignore ├── Dockerfile ├── README.md ├── app.py ├── config.py ├── crawler ├── __init__.py └── music_crawler.py ├── data └── artist_list.txt ├── format_data.py ├── masks └── romano.png ├── model.py ├── model ├── __init__.py ├── input_pipeline.py ├── rnn.py ├── sample_generator.py ├── song_generator.py └── song_model.py ├── preprocessing ├── __init__.py ├── dataset.py ├── text_preprocessing.py └── tfrecord.py ├── requirements.txt ├── sample.py ├── scripts ├── create_sample.sh ├── create_wordcloud_graph.sh ├── download_musics.sh ├── run_format_data.sh ├── run_model.sh └── run_vagalume_crawler.sh ├── song_word_cloud.py ├── utils ├── __init__.py ├── progress_bar.py └── session_manager.py ├── vagalume_crawler.py └── vagalume_downloader.py /.dockerignore: -------------------------------------------------------------------------------- 1 | .git 2 | .dockerignore 3 | __pycache__/* 4 | scripts/* 5 | crawler/* 6 | data/* 7 | !data/song_dataset/word2index.pkl 8 | !data/song_dataset/index2word.pkl 9 | proibidao/* 10 | !proibidao/song_dataset/word2index.pkl 11 | !proibidao/song_dataset/index2word.pkl 12 | kondzilla/* 13 | !kondzilla/song_dataset/word2index.pkl 14 | !kondzilla/song_dataset/index2word.pkl 15 | ostentacao/* 16 | !ostentacao/song_dataset/word2index.pkl 17 | !ostentacao/song_dataset/index2word.pkl 18 | preprocessing/* 19 | vagalume_crawler.py 20 | vagalume_downloader.py 21 | model.py 22 | sample.py 23 | 24 | -------------------------------------------------------------------------------- /.flake8: -------------------------------------------------------------------------------- 1 | [flake8] 2 | exclude = .git,__pycache__,data/,scripts/ 3 | max-line-length=100 4 | -------------------------------------------------------------------------------- /.gitignore: -------------------------------------------------------------------------------- 1 | *.pyc 2 | data/* 3 | -------------------------------------------------------------------------------- /Dockerfile: -------------------------------------------------------------------------------- 1 | FROM debian:testing 2 | 3 | RUN apt-get update -qy && apt-get install python3-pip -qy 4 | RUN pip3 install tensorflow==1.4.1 flask flask-cors gunicorn 5 | 6 | RUN mkdir -p /funk-generator/ 7 | ADD . /funk-generator/ 8 | WORKDIR /funk-generator/ 9 | 10 | EXPOSE 5000 11 | 12 | CMD ["gunicorn", "-b", "0.0.0.0:5000", "app:app"] 13 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | Funk Generator 2 | =================== 3 | 4 | Decidi criar este projeto quando estava aprendendo sobre como modelos de linguagem usando Deep 5 | Learning funcionam. Escrevi mais detalhadamente sobre este projeto neste 6 | [post](https://medium.com/@lucasmoura_35920/mc-neural-o-funkeiro-artificial-ab6fbedc9771) no Medium. 7 | Além disso, fiz também uma descrição mais técnica do projeto em um 8 | [post](http://lmoura.me/blog/2018/05/07/funk-generator/) no meu blog. Por fim, o programa está 9 | rodando em uma [página](http://lmoura.me/funk_generator/) do meu blog. Lá você pode gerar músicas em 10 | tempo real. 11 | 12 | Neste documento, irei mostrar como fazer para que você consiga rodar o projeto do zero, ou até usar 13 | outras músicas para treinar o seu modelo. 14 | 15 | Dependências do projeto 16 | --------- 17 | 18 | Antes de tudo, instale as dependências necessárias para executar este projeto: 19 | 20 | ```sh 21 | $ pip install -r requirements.txt 22 | ``` 23 | 24 | Coleta das músicas 25 | ---------- 26 | 27 | A lista dos artistas que usei pode ser encontrada na pasta *data*, com o nome de *artist_list.txt*. 28 | Esse arquivo é usado para fazer um crawler na API do [vagalume](https://api.vagalume.com.br/). 29 | 30 | Para começar o crawler, execute o seguinte comando: 31 | 32 | ```sh 33 | $ ./scripts/run_vagalume_crawler.sh 34 | ``` 35 | 36 | Este script irá passar por todos os artistas da lista e irá criar um diretório para cada artista na 37 | pasta *data*. Dentro de cada diretório, será criado dos arquivos distintos: 38 | 39 | * *song_codes.txt*: Um arquivo contendo o código de todas as músicas do artista. Esse código é usado 40 | para fazer o download da música em si, usando a API do vagalume. 41 | * *song_names.txt*: Um arquivo contendo o nome das músicas do artista. 42 | 43 | Uma vez que este script tenha sido executado, pode-se então executar o seguinte script: 44 | 45 | ```sh 46 | $ ./scripts/download_musics.sh 47 | ``` 48 | 49 | Este script vai entrar em cada diretório que representa um artista e baixar todas as músicas 50 | presentes no *song_codes.txt* daquele diretório. Cada música é armazenada como um arquivo txt. 51 | 52 | 53 | Formatar os dados 54 | ------------------ 55 | 56 | Após o download das músicas, é necessário converter os arquivos de texto para um formato que o 57 | modelo entenda. Para isso, rode o seguinte script: 58 | 59 | ```sh 60 | $ ./scripts/run_format_data.sh 61 | ``` 62 | 63 | Esse script irá criar um pasta chamada *song_dataset* dentro da pasta *data*. Dentro dessa pasta, 64 | terá os arquivos já processados para serem treinados pelo modelo. 65 | 66 | *OBS: Nesse meu projeto eu criei quatro modelos diferentes para tipos diferentes de funk (Kondzilla, 67 | Proibidão, Ostentção e todas as músicas) Entretanto, essa separação foi feita manualmente. Eu tive 68 | que decidir quais músicas eram só de Ostentação, por exemplo. Para isso, olhei músicas que tinham 69 | certos termos característicos e as agrupei uma pasta diferente. Ou seja, aqui essa separação não 70 | será feita de forma automática. Se você quiser treinar os 4 modelos como eu fiz, terá que fazer esta 71 | etapa manualmente.* 72 | 73 | Treinamento do modelo 74 | ---------------------- 75 | 76 | Uma vez com os dados formatados, basta rodar o seguinte script: 77 | 78 | ```sh 79 | $ ./script/run_model.sh 80 | ``` 81 | 82 | Após o treinamento ser concluído, será criado um diretório chamado *checkpoint* na raiz do projeto. 83 | Esse diretório é o modelo treinado. Caso queria continuar treinando esse modelo gerado, altere a 84 | variável *USE_CHECKPOINT* no script *run_model.sh* 85 | 86 | 87 | Gerar Músicas 88 | -------------- 89 | 90 | Para testar o modelo e gerar algumas músicas, rode o seguinte script: 91 | 92 | ```sh 93 | ./scripts/create_sample.sh 94 | ``` 95 | 96 | Esse script irá gerar uma música por vez. 97 | 98 | 99 | Criar API 100 | --------------------- 101 | 102 | Ao final, rode o seguinte comando para subir a API do projeto: 103 | 104 | ```sh 105 | python app.py 106 | ``` 107 | 108 | O aplicação rodará com o servidor default do Flask e permite que você teste o modelo por requisições 109 | POST. Lembre-se que se você só treinar um modelo, o mesmo só reconhecerá o modelo com id 1. Logo, 110 | lembre-se de setar o id da requisição POST como 1 sempre. Por default, o programa será executado na 111 | porta 5000. 112 | 113 | Além disso, como em produção eu queria que a geração das músicas fosse o mais rápido possível, eu 114 | gerei 4 mil músicas e armazenei elas dentro da minha aplicação (Essas músicas não estão presentes 115 | nesse repositório). Sendo assim, a sua requisição POST não pode ter a variável *sentence* vazia, 116 | pois caso ela esteja vazia, o modelo vai pegar aleatoriamente uma das músicas já armazenadas. 117 | 118 | Dessa forma, o modelo só gera músicas em tempo real se o valor da variável *sentence* não for vazio. 119 | 120 | Recomendo que se você queria usar esse modelo em produção como eu fiz, usar outra aplicação para 121 | fazer o servidor, como o [gunicorn](http://gunicorn.org/). 122 | 123 | 124 | Container 125 | -------------- 126 | 127 | Caso você queira apenas usar a aplicação sem criar o modelo do zero, pode usar o container que 128 | criei. Para isso é necessário ter o [Docker](https://www.docker.com/) instalado. 129 | 130 | Uma vez com ele instalado, execute o seguinte comando para baixar a imagem do container: 131 | 132 | ```sh 133 | $ docker pull lucasmoura/funk-generator 134 | ``` 135 | 136 | E para executar este container: 137 | 138 | ```sh 139 | $ docker run -d -p 5000:5000 lucasmoura/funk-generator 140 | ``` 141 | 142 | Com o container rodando, basta seguir os passos descritos na seção *Criar API* para usar o programa. 143 | Entretanto, o container tem uma vantagem. Nele existem todos os 4 modelos de funk que criei e também 144 | nele está presente as 4 mil músicas que criei. Dessa forma, você não está restrito a sempre deixar a 145 | variável *id* como 1, e também pode deixar a variável *sentence* vazia, caso queira recuperar 146 | algumas das músicas já criadas. 147 | 148 | Gere seu próprio modelo com suas próprias músicas 149 | --------------- 150 | 151 | Para gerar seu próprio modelo, basta com que você mude o arquivo *artist_list.txt* para conter os 152 | artistas que você quiser e depois é só seguir todos os passos já listados. 153 | 154 | Caso você já tenhas as músicas baixadas, garanta que cada artista tem um diretório próprio e que 155 | todas as músicas desse artista estejam no diretório que o representa. Uma vez isso pronto, basta 156 | continuar à partir da seção *Formatar dados* 157 | -------------------------------------------------------------------------------- /app.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import random 3 | 4 | from flask import Flask, request, jsonify 5 | from flask_cors import CORS 6 | 7 | from model.sample_generator import create_sample 8 | from config import all_args, kondzilla_args, proibidao_args, ostentacao_args 9 | 10 | 11 | app = Flask(__name__) 12 | CORS(app) 13 | 14 | 15 | # load the model 16 | def load(path): 17 | with open(path, 'rb') as f: 18 | return pickle.load(f) 19 | 20 | 21 | all_songs = load('generated_songs/generated-all-songs.pkl') 22 | kondzilla_songs = load('generated_songs/generated-kondzilla-songs.pkl') 23 | proibidao_songs = load('generated_songs/generated-proibidao-songs.pkl') 24 | ostentacao_songs = load('generated_songs/generated-ostentacao-songs.pkl') 25 | 26 | all_sampler = create_sample(all_args) 27 | kondzilla_sampler = create_sample(kondzilla_args) 28 | proibidao_sampler = create_sample(proibidao_args) 29 | ostentacao_sampler = create_sample(ostentacao_args) 30 | 31 | 32 | def get_song(model_id): 33 | random_num = random.randint(0, len(all_songs) - 1) 34 | 35 | if model_id == 1: 36 | return all_songs[random_num][1:].replace('\n', '
') 37 | elif model_id == 2: 38 | return kondzilla_songs[random_num][1:].replace('\n', '
') 39 | elif model_id == 3: 40 | return proibidao_songs[random_num][1:].replace('\n', '
') 41 | else: 42 | return ostentacao_songs[random_num][1:].replace('\n', '
') 43 | 44 | 45 | # API route 46 | @app.route('/api', methods=['POST']) 47 | def api(): 48 | """API function 49 | 50 | All model-specific logic to be defined in the get_model_api() 51 | function 52 | """ 53 | data = request.json 54 | model_id = int(data['id']) 55 | prime_words = data['sentence'] 56 | 57 | if prime_words == '': 58 | output_data = get_song(model_id) 59 | else: 60 | output_data = -1 61 | if model_id == 1: 62 | while output_data == -1: 63 | output_data = all_sampler(prime_words, html=True) 64 | elif model_id == 2: 65 | while output_data == -1: 66 | output_data = kondzilla_sampler(prime_words, html=True) 67 | elif model_id == 3: 68 | while output_data == -1: 69 | output_data = proibidao_sampler(prime_words, html=True) 70 | else: 71 | while output_data == -1: 72 | output_data = ostentacao_sampler(prime_words, html=True) 73 | 74 | return jsonify(output_data) 75 | 76 | 77 | @app.route('/') 78 | def index(): 79 | return "Index API" 80 | 81 | 82 | # HTTP Errors handlers 83 | @app.errorhandler(404) 84 | def url_error(e): 85 | return """ 86 | Wrong URL! 87 |
{}
""".format(e), 404 88 | 89 | 90 | @app.errorhandler(500) 91 | def server_error(e): 92 | return """ 93 | An internal error occurred:
{}
94 | See logs for full stacktrace. 95 | """.format(e), 500 96 | 97 | 98 | if __name__ == '__main__': 99 | # This is used when running locally. 100 | app.run(host='0.0.0.0', debug=True) 101 | -------------------------------------------------------------------------------- /config.py: -------------------------------------------------------------------------------- 1 | from collections import defaultdict 2 | 3 | 4 | default_args = { 5 | 'use_checkpoint': True, 6 | 'embedding_size': 300, 7 | 'num_layers': 3, 8 | 'num_units': 728 9 | } 10 | 11 | all_args = { 12 | 'checkpoint_path': 'checkpoint', 13 | 'index2word_path': 'data/song_dataset/index2word.pkl', 14 | 'word2index_path': 'data/song_dataset/word2index.pkl', 15 | 'vocab_size': 12551, 16 | } 17 | all_args = {**default_args, **all_args} 18 | all_args = defaultdict(int, all_args) 19 | 20 | kondzilla_args = { 21 | 'checkpoint_path': 'kondzilla_checkpoint', 22 | 'index2word_path': 'kondzilla/song_dataset/index2word.pkl', 23 | 'word2index_path': 'kondzilla/song_dataset/word2index.pkl', 24 | 'vocab_size': 2192, 25 | } 26 | kondzilla_args = {**default_args, **kondzilla_args} 27 | kondzilla_args = defaultdict(int, kondzilla_args) 28 | 29 | proibidao_args = { 30 | 'checkpoint_path': 'proibidao_checkpoint', 31 | 'index2word_path': 'proibidao/song_dataset/index2word.pkl', 32 | 'word2index_path': 'proibidao/song_dataset/word2index.pkl', 33 | 'vocab_size': 1445, 34 | } 35 | proibidao_args = {**default_args, **proibidao_args} 36 | proibidao_args = defaultdict(int, proibidao_args) 37 | 38 | ostentacao_args = { 39 | 'checkpoint_path': 'ostentacao_checkpoint', 40 | 'index2word_path': 'ostentacao/song_dataset/index2word.pkl', 41 | 'word2index_path': 'ostentacao/song_dataset/word2index.pkl', 42 | 'vocab_size': 2035, 43 | } 44 | ostentacao_args = {**default_args, **ostentacao_args} 45 | ostentacao_args = defaultdict(int, ostentacao_args) 46 | -------------------------------------------------------------------------------- /crawler/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/crawler/__init__.py -------------------------------------------------------------------------------- /crawler/music_crawler.py: -------------------------------------------------------------------------------- 1 | import os 2 | import time 3 | import requests 4 | import unicodedata 5 | 6 | from bs4 import BeautifulSoup 7 | from pathlib import Path 8 | 9 | 10 | def remove_accented_characters(name): 11 | name = unicodedata.normalize('NFD', name).encode('ascii', 'ignore') 12 | return name.decode('ascii') 13 | 14 | 15 | def clean_name(name): 16 | name = name.strip() 17 | name = name.lower() 18 | name = name.replace(' ', '-') 19 | name = name.replace('/', '') 20 | name = remove_accented_characters(name) 21 | 22 | return name 23 | 24 | 25 | class MusicCrawler: 26 | 27 | def __init__(self, artist_list_path, data_folder): 28 | self.artist_list_path = artist_list_path 29 | self.data_folder = Path(data_folder) 30 | self.artists = None 31 | 32 | self.vagalume_url = 'https://www.vagalume.com.br/' 33 | 34 | def remove_accented_characters(self, artist_name): 35 | artist_name = unicodedata.normalize('NFD', artist_name).encode('ascii', 'ignore') 36 | return artist_name.decode('ascii') 37 | 38 | def parse_artist_name(self, artist_name): 39 | return clean_name(artist_name) 40 | 41 | def load_artists(self): 42 | with open(self.artist_list_path, 'r') as artist_file: 43 | artists = artist_file.readlines() 44 | 45 | self.artists = [self.parse_artist_name(artist) for artist in artists] 46 | 47 | def find_tracks_list(self, html_page): 48 | songs = [] 49 | 50 | tracks = html_page.find('ul', {'class': 'tracks'}) 51 | tracks_hrefs = tracks.findAll('a', href=True) 52 | 53 | for track_href in tracks_hrefs: 54 | name = track_href.get_text() 55 | code = track_href.get('data-song') 56 | 57 | if not code or not name: 58 | continue 59 | 60 | songs.append((name, code)) 61 | 62 | return songs 63 | 64 | def get_artist_songs(self, artist_name): 65 | artist_url = self.vagalume_url + artist_name 66 | 67 | response = requests.get(artist_url) 68 | parsed_response = BeautifulSoup(response.content, 'html.parser') 69 | 70 | artist_songs = self.find_tracks_list(parsed_response) 71 | 72 | return artist_songs 73 | 74 | def save_data(self, save_path, save_list): 75 | with save_path.open(mode='w') as f: 76 | for name in save_list: 77 | f.write(name + '\n') 78 | 79 | def save_artist_songs_info(self, artist, artist_songs): 80 | song_names = [name for name, code in artist_songs] 81 | song_codes = [code for name, code in artist_songs] 82 | 83 | save_path = self.data_folder / artist 84 | 85 | if not save_path.exists(): 86 | save_path.mkdir() 87 | 88 | save_path_names = save_path / 'song_names.txt' 89 | save_path_codes = save_path / 'song_codes.txt' 90 | 91 | self.save_data(save_path_names, song_names) 92 | self.save_data(save_path_codes, song_codes) 93 | 94 | def crawl_musics(self): 95 | self.load_artists() 96 | 97 | for artist in self.artists: 98 | print('Getting songs of {}...'.format(artist)) 99 | 100 | artist_songs = self.get_artist_songs(artist) 101 | self.save_artist_songs_info(artist, artist_songs) 102 | 103 | 104 | class MusicDownloader: 105 | 106 | def __init__(self, key_file_path, data_folder, code_file_name): 107 | self.key_file_path = key_file_path 108 | self.data_folder = Path(data_folder) 109 | self.code_file_name = code_file_name 110 | 111 | self.api_url = 'https://api.vagalume.com.br/search.php?musid={}&apikey{}' 112 | 113 | def load_api_key(self): 114 | with open(self.key_file_path, 'r') as key_file: 115 | self.api_key = key_file.read().strip() 116 | 117 | def load_codes(self, codes_path): 118 | with codes_path.open() as code_file: 119 | codes = code_file.readlines() 120 | codes = [code.strip() for code in codes] 121 | 122 | return codes 123 | 124 | def make_request(self, code): 125 | return requests.get(self.api_url.format(code, self.api_key)) 126 | 127 | def save_songs(self, songs, artist_path): 128 | for song_name, song in songs: 129 | song_name += '.txt' 130 | song_path = artist_path / song_name 131 | 132 | with song_path.open(mode='w') as song_file: 133 | song_file.write(song) 134 | 135 | def download_songs(self, artist_name): 136 | codes_path = self.data_folder / artist_name / self.code_file_name 137 | codes = self.load_codes(codes_path) 138 | songs = [] 139 | 140 | for code in codes: 141 | self.make_request(code) 142 | json_response = self.make_request(code) 143 | 144 | song_name, song = self.parse_json_response(json_response) 145 | songs.append((song_name, song)) 146 | time.sleep(3) 147 | 148 | artist_path = self.data_folder / artist_name 149 | self.save_songs(songs, artist_path) 150 | 151 | def clean_song_name(self, song_name): 152 | return clean_name(song_name) 153 | 154 | def parse_json_response(self, json_response): 155 | json_dict = json_response.json() 156 | 157 | song = json_dict['mus'][0]['text'] 158 | song_name = json_dict['mus'][0]['name'] 159 | 160 | return self.clean_song_name(song_name), song 161 | 162 | def download_all_songs(self): 163 | self.load_api_key() 164 | 165 | for artist in os.listdir(str(self.data_folder)): 166 | 167 | artist_path = self.data_folder / artist 168 | if not artist_path.is_dir(): 169 | continue 170 | 171 | print('Downloading songs from {}'.format(artist)) 172 | self.download_songs(artist) 173 | -------------------------------------------------------------------------------- /data/artist_list.txt: -------------------------------------------------------------------------------- 1 | 3zi 2 | Allycia 3 | Amaral Mc 4 | Amilcka e Chocolate 5 | Andrezinho Shock 6 | Androw 7 | Anitta 8 | Antunes 9 | Arquiteto do Amor 10 | As Danadinhas 11 | As Experimentas 12 | As Tchutchucas 13 | As Tequileiras do Funk 14 | Babi 15 | Backdi e Bio G3 16 | Backdi e Bio-g3 17 | Banda Aerosol 18 | Berti 19 | Biel 20 | Bob Rum 21 | Bola de Fogo 22 | Bonde Da Madrugada 23 | Bonde Das Gramadas 24 | Bonde Das Maravilhas 25 | Bonde Do Macaco 26 | Bonde Do Ratão 27 | Bonde Life Stronda 28 | Bonde Nervoso 29 | Bonde Neurose 30 | Bonde R300 31 | Bonde Tesão 32 | Bonde da Oskley 33 | Bonde das Dancinhas 34 | Bonde das Impostora 35 | Bonde do Canguru 36 | Bonde do Come Quieto 37 | Bonde do Mainstream 38 | Bonde do Rolê 39 | Bonde do Tigrão 40 | Bonde do Vinho 41 | Bonde dos Perversos 42 | Bonde dos magrinhos 43 | Bruno Moreira 44 | Buchecha 45 | Bó do Catarina 46 | Cabeçudos 47 | Careca e Pixote 48 | Caroline Miranda 49 | Casa de Farinha 50 | Chiquinho e Amaral 51 | Ciclone 52 | Cidinho e Doca 53 | Claudinho 54 | Claudinho Buchecha 55 | Creidi 56 | Cru 57 | Cubanu´s 58 | DJ Cris 59 | DJ Malboro 60 | DJ R7 61 | DJ San e Dieguinho G 62 | DJ Thiago 63 | Danda E Taffarel 64 | Dandara 65 | Dani Brinks 66 | Dani Russo 67 | Danilo e Fabinho 68 | David Bolado 69 | Deize Tigrona 70 | Dennis Dj 71 | Dinho da Vp 72 | Diogo Martins 73 | Dj DANIEL 74 | Dj Dennis 75 | Dj Filé 76 | Dj Marlboro 77 | Dj Tiago 78 | Dr. Bacalhau 79 | Dream Team do Passinho 80 | Duda do Marapé 81 | Ed Pupone 82 | Edu Gueda 83 | Edy Lemond 84 | Efeito Contrário 85 | Estação Zero 86 | Fabão Brazil 87 | Fagner Pinheiro 88 | Formiga DJ 89 | Furacão 2000 90 | Fábio Sina 91 | Fábrica da Arte 92 | Gaiola Das Popozudas 93 | Gaab 94 | Gorila e Preto 95 | Grafitte 07 96 | Igor Almeida 97 | Jah Mai 98 | Jaqueline Cindy 99 | Jaula Das Gostosudas 100 | Jerry Smith 101 | Jess 102 | Jherry 103 | Jojo Maronttinni 104 | Jonathan Costa 105 | Juliana e As Fogosas 106 | Junior Launther 107 | Justiceiras Do Funk 108 | Karinna Spencer 109 | Kula 110 | Labanca 111 | Lady Fortunato 112 | Larica Dos Mulekes 113 | Leandro e As Abusadas 114 | Lenny B 115 | Leo Kiss 116 | Leo Sannttos 117 | Lippe 118 | Los Torrones 119 | Louco de Refri 120 | Lucas Angelo 121 | Lucca Venuci 122 | Ludmilla 123 | Malas On Line 124 | Malha Funk 125 | Malibu 126 | Marcinho e Cacau 127 | Max Pierre 128 | Max Rocha 129 | Maíra Brasil 130 | Mc 2k 131 | Mc Ale Soares 132 | Mc Alexandre 133 | Mc Andinho 134 | Mc Andrewzinho 135 | Mc Andrezinho do Complexo 136 | Mc Arthur o Sheik 137 | Mc B o 138 | Mc B.ó 139 | Mc Babi 140 | Mc Bahea 141 | Mc Barriga 142 | Mc Batata 143 | Mc Bella 144 | Mc Bellot 145 | Mc Belzinho 146 | Mc Biel NPF 147 | Mc Bielzinho 148 | Mc Biju 149 | Mc Bin Laden 150 | Mc Biro Leyby 151 | Mc Bobô 152 | Mc Bocarra 153 | Mc Bola 154 | Mc Bolado 155 | Mc Boy do Charmes 156 | Mc Brinquedo 157 | Mc Brisola 158 | Mc Britney 159 | Mc Bruninha 160 | Mc Bruninho 161 | Mc Bruno IP 162 | Mc Bruxo 163 | Mc Buiu 164 | Mc Byana 165 | Mc CL 166 | Mc Cabelinho 167 | Mc Cabide 168 | Mc Cacau 169 | Mc Careca 170 | Mc Carioca 171 | Mc Carol 172 | Mc Caçula 173 | Mc Cebezinho 174 | Mc Cezareth 175 | Mc Chapo 176 | Mc Charles da Alemoa 177 | Mc Chavero 178 | Mc Chicão 179 | Mc Chiquinho 180 | Mc Choko 181 | Mc Clebinho 182 | Mc Colibri 183 | Mc Copinho 184 | Mc Coringa 185 | Mc Coringa Louco 186 | Mc CpK 187 | Mc Crash 188 | Mc Cris 189 | Mc Cristiane 190 | Mc Cruel 191 | Mc Créu Funk 192 | Mc DG 193 | Mc Dada Boladão 194 | Mc Dadinho e Diguinho 195 | Mc Daleste 196 | Mc Danado 197 | Mc Dani 198 | Mc Danilo Boladão 199 | Mc Danilo Zika 200 | Mc Davi 201 | Mc David Bolado 202 | Mc Decão 203 | Mc Dedé 204 | Mc Delano 205 | Mc Delley FD 206 | Mc Dentinho 207 | Mc Denny 208 | Mc Dido 209 | Mc Didô 210 | Mc Dieguinho 211 | Mc Diguinho 212 | Mc Digão 213 | Mc Dimenor Dr 214 | Mc Dingo 215 | Mc Dino 216 | Mc Discolado 217 | Mc Dodo 013 218 | Mc Dodô 219 | Mc Don Juan 220 | Mc Doriva 221 | Mc Douglinhas 222 | Mc Dudu 223 | Mc Duduzinho 224 | Mc Eller 225 | Mc Etiopia 226 | Mc Fabin da VL 227 | Mc Fabuloso 228 | Mc Fael 229 | Mc Falcon 230 | Mc Falcão 231 | Mc Farmá 232 | Mc Federado e os Leleks 233 | Mc Felipe Boladão 234 | Mc Felype 235 | Mc Filhão 236 | Mc Fininho 237 | Mc Fioti 238 | Mc Frank 239 | Mc G15 240 | Mc G3 241 | Mc G7 242 | Mc GB 243 | Mc Gaah e Mc BP 244 | Mc Galo 245 | Mc Gibi 246 | Mc Gil do Andaraí 247 | Mc God 248 | Mc Godô 249 | Mc Gringo 250 | Mc Guga da VG 251 | Mc Gui 252 | Mc Guimê 253 | Mc Guto 254 | Mc Gw 255 | Mc Hariel 256 | Mc Hollywood 257 | Mc Hudson 22 258 | Mc Huguinho 259 | Mc IG 260 | Mc Illana 261 | Mc Islaibe 262 | Mc Italo 263 | Mc J15 264 | Mc JG 265 | Mc Jadson Boladão 266 | Mc Jair da Rocha 267 | Mc Japa 268 | Mc Japa e Mc Japinha 269 | Mc Jean Paul 270 | Mc Jefinho 271 | Mc Jennifer 272 | Mc Jenny 273 | Mc Jerry 274 | Mc Jhey 275 | Mc John Marquês 276 | Mc Johnzinho 277 | Mc Jotta Pê 278 | Mc João 279 | Mc Joãozinho VT 280 | Mc Juninho 281 | Mc Juninho Jr 282 | Mc Júnior e Leonardo 283 | Mc K9 284 | Mc Kaká 285 | Mc Kapela 286 | Mc Kapela MK 287 | Mc Karyne da Provi 288 | Mc Katia 289 | Mc Kauan 290 | Mc Keke 291 | Mc Kekel 292 | Mc Kelvin 293 | Mc Kelvinho 294 | Mc Kevin 295 | Mc Kevinho 296 | Mc Kitinho 297 | Mc Koringa 298 | Mc Ks 299 | Mc LB 300 | Mc LBX 301 | Mc Lan 302 | Mc Lano 303 | Mc Lany 304 | Mc Lary Figueiredo 305 | Mc Laís 306 | Mc Leke 307 | Mc Leléto 308 | Mc Leoest 309 | Mc Leozinho 310 | Mc Leozinho do Recife 311 | Mc Lexx 312 | Mc Lipi 313 | Mc Lipivox 314 | Mc Livinho 315 | Mc Loirinha 316 | Mc Loma E As Gêmeas Lacração 317 | Mc Lon 318 | Mc Lone 319 | Mc Luan 320 | Mc Luan 321 | Mc Luciano Sp 322 | Mc Lukaz 323 | Mc Lustosa 324 | Mc Léo da Baixada 325 | Mc Mac Air 326 | Mc Magal 327 | Mc Magrinho 328 | Mc Maha 329 | Mc Maiquinho 330 | Mc Mallone 331 | Mc Maneirinho 332 | Mc Marcelly 333 | Mc Marcinho 334 | Mc Marcio Braz 335 | Mc Marcio G 336 | Mc Marina 337 | Mc Marks 338 | Mc Marlin 339 | Mc Maromba 340 | Mc Martinho 341 | Mc Mascote 342 | Mc Matheuzinho DK 343 | Mc Max 344 | Mc Mazinho 345 | Mc Melqui 346 | Mc Menassi 347 | Mc Menor 348 | Mc Menor da VG 349 | Mc Menorzinha 350 | Mc Menorzão 351 | Mc Metal e Cego 352 | Mc Milk 353 | Mc Mingau 354 | Mc Mirella 355 | Mc Misa 356 | Mc Ml 357 | Mc Mm 358 | Mc Moreno 359 | Mc Mágico 360 | Mc Márcio G 361 | Mc Mãozinha 362 | Mc Naldinho 363 | Mc Nandinho 364 | Mc Nany 365 | Mc Natacha 366 | Mc Nayara 367 | Mc Nego Bam 368 | Mc Nego Blue 369 | Mc Neguinho 370 | Mc Negão do Arizona 371 | Mc Neguinho do Kaxeta 372 | Mc Nem 373 | Mc Neném 374 | Mc Nice 375 | Mc Nobruh 376 | Mc North 377 | Mc Novinha 378 | Mc Novinho 379 | Mc Ombrinho 380 | Mc Orelha 381 | Mc PH 382 | Mc PP Da VS 383 | Mc PR 384 | Mc Pack Original 385 | Mc Papo 386 | Mc Patoroko 387 | Mc Patrick MF 388 | Mc Pedrinho 389 | Mc Pedrinho Jr 390 | Mc Pedrinho e Mc Léo da Baixada 391 | Mc Pekeno 392 | Mc Pelé 393 | Mc Pereira 394 | Mc Pet 395 | Mc Petter 396 | Mc Phe Cachorrera 397 | Mc Pierre 398 | Mc Pikachu 399 | Mc PikenaSK 400 | Mc Pingo 401 | Mc Pirata 402 | Mc Pivete 403 | Mc Pocahontas 404 | Mc Poiaka 405 | Mc Primo 406 | Mc Princesa e Plebeu 407 | Mc Pé de Pano 408 | Mc R1 409 | Mc Rael 410 | Mc Rael Souza 411 | Mc Rafa 412 | Mc Renan 413 | Mc Ricardinho 414 | Mc Ricardo 415 | Mc Richesse 416 | Mc Rick Lima 417 | Mc Rihanna da Baixada 418 | Mc Rita 419 | Mc Roba Cena 420 | Mc Robertinho 421 | Mc Robinho de Prata 422 | Mc Robs 423 | Mc Rodolfinho 424 | Mc Rodrigão 425 | Mc Rojai 426 | Mc Romeu 427 | Mc Ronny Khalifa 428 | Mc Rose 429 | Mc Rubby 430 | Mc Ruzika 431 | Mc Sabrina 432 | Mc Sabãozinho 433 | Mc Saed 434 | Mc Samuka e Nego 435 | Mc Sapão 436 | Mc Sargento 437 | Mc Savinon 438 | Mc Serginho 439 | Mc Serginho da VS 440 | Mc Sluck 441 | Mc Smith 442 | Mc Suave 443 | Mc Suellen 444 | Mc Sunda 445 | Mc Suzy 446 | Mc TH 447 | Mc TL 448 | Mc Tarapi 449 | Mc Tartaruga 450 | Mc Tati Zaqui 451 | Mc Tavinho JBK 452 | Mc Taz 453 | Mc Tchesko 454 | Mc Teco e Buzunga 455 | Mc Tevez 456 | Mc Thesko 457 | Mc Tiki 458 | Mc Tikão 459 | Mc Timbu 460 | Mc Tiozinho 461 | Mc Tom 462 | Mc Troia 463 | Mc Tupan 464 | Mc Uchoa 465 | Mc Vareta 466 | Mc Vine 467 | Mc Viné 468 | Mc Vitinho 469 | Mc Vitinho 2 470 | Mc Vitão 471 | Mc Vitêra 472 | Mc Vuk Vuk 473 | Mc Wc 474 | Mc Wendy 475 | Mc William-SP 476 | Mc Wm 477 | Mc Xlep 478 | Mc Yago 479 | Mc Yuri BH 480 | Mc Yuri BH 481 | Mc Zaac 482 | Mc Zedek 483 | Mc Zoi de Gato 484 | Mc Zuka 485 | Mc aw 486 | Mc kill 487 | McViktor 488 | Mcs BW 489 | Mcs Deco e Luco 490 | Mcs Gêmeos 491 | Mcs Jhowzinho e Kadinho 492 | Mcs Magrelo e Nenê 493 | Mcs Nenem e Magrão 494 | Mcs Samuka e Nego 495 | Mcs Zaac e Jerry 496 | Medrado 497 | Meik Of 498 | Menor 499 | Menor da Provi 500 | Menor do Chapa 501 | Menor do Chapa 502 | Mensageiros da Favela 503 | Michel Grasiani 504 | Mike 505 | Milennium Cia Show 506 | Miryan Martin 507 | Mr Catra 508 | Mr Poll 509 | Mr Pézão 510 | Mr. Fia 511 | Mr. Jamaica 512 | Mr. Mu 513 | Mulher Filé 514 | Mulher Gato 515 | Mulher Melancia 516 | Mulher Melão 517 | Mulher Moranguinho 518 | Márcio E Goró 519 | Márcio do Cacuia 520 | Mó-H 521 | Naldo 522 | Nanda Black 523 | Nanda Lynn 524 | Nanda Lyra 525 | Nathy Souto 526 | Nayara 527 | Nego do Borel 528 | Neguinho Do Caxeta 529 | Neguinho do Caxeta 530 | Negão do Arizona 531 | Nélio e Espiga 532 | Olliver 533 | Os Atraentes 534 | Os Atrevidos 535 | Os Avassaladores 536 | Os Bad Boys Funk 537 | Os Carrascos 538 | Os Caçadores 539 | Os Cretinos 540 | Os Danadinhos 541 | Os Don Juan 542 | Os Hawaianos 543 | Os Magrinhos 544 | Os Nerds 545 | Os Novinhos 546 | Os Ousados 547 | Os Polêmicos 548 | Os Sem Noção 549 | Os Vadios 550 | Os poETs 551 | Oz Muleke’s 552 | Oz Predadorez 553 | PH Lima 554 | Pablo Renato 555 | Pancadão do Caldeirão do Huck 556 | Paranga 557 | Perlla 558 | Pikeno 559 | Pikeno e Menor 560 | Priscila Nocetti 561 | Prud Rey 562 | Rafael Vidalles 563 | Raimundo Soldado 564 | Raphaella 565 | Renatinho e Alemão 566 | Robinho da Prata 567 | Rocket Pocket 568 | Rodney Dy 569 | Sabrina Boing Boing 570 | San Danado 571 | Saulo Matos 572 | Sd Boys 573 | Sharon Axé Moi 574 | Silvio e Robson 575 | Sol do Recanto 576 | Stronda 4 life 577 | Suspeita 578 | Tathi Kiss 579 | Tati Quebra Barraco 580 | Tiago Miller 581 | Valentes do funk 582 | Valesca Popozuda 583 | Vanessinha Pikatchu 584 | Verônica Costa 585 | Vine Rodry 586 | Vinicius e Andinho 587 | Vinny e Will 588 | -------------------------------------------------------------------------------- /format_data.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from pathlib import Path 4 | 5 | from preprocessing.dataset import MusicDataset 6 | from preprocessing.tfrecord import SentenceTFRecord 7 | from preprocessing.text_preprocessing import (get_vocabulary, create_word_dictionaties, 8 | save, replace_unk_words, replace_words_with_ids, 9 | create_labels, get_sizes_list, create_chunks) 10 | 11 | 12 | DATA = 0 13 | LABELS = 1 14 | SIZES = 2 15 | 16 | 17 | def save_dataset(dataset_all, dataset_type, dataset_save_path): 18 | save_path = dataset_save_path / dataset_type 19 | 20 | if not save_path.is_dir(): 21 | save_path.mkdir() 22 | 23 | data = dataset_all[DATA] 24 | data_save_path = save_path / str(dataset_type + '_data.pkl') 25 | save(data, data_save_path) 26 | 27 | labels = dataset_all[LABELS] 28 | labels_save_path = save_path / str(dataset_type + '_labels.pkl') 29 | save(labels, labels_save_path) 30 | 31 | sizes = dataset_all[SIZES] 32 | sizes_save_path = save_path / str(dataset_type + '_sizes.pkl') 33 | save(sizes, sizes_save_path) 34 | 35 | 36 | def create_tfrecord(dataset, dataset_type, dataset_save_path): 37 | save_path = dataset_save_path / dataset_type / str(dataset_type + '.tfrecord') 38 | sentence_tfrecord = SentenceTFRecord(dataset, str(save_path)) 39 | sentence_tfrecord.parse_sentences() 40 | 41 | 42 | def full_preprocessing(train, validation, test, data_folder, 43 | dataset_save_path, min_frequency): 44 | 45 | data_folder = Path(data_folder) 46 | 47 | print('Creating vocabulary ...') 48 | vocabulary = get_vocabulary(train, min_frequency) 49 | print('Vocabulary lenght: {}'.format(len(vocabulary))) 50 | word2index, index2word = create_word_dictionaties(vocabulary) 51 | 52 | index2word_path = data_folder / dataset_save_path / 'index2word.pkl' 53 | save(index2word, index2word_path) 54 | word2index_path = data_folder / dataset_save_path / 'word2index.pkl' 55 | save(word2index, word2index_path) 56 | 57 | print('Replacing unknown words ...') 58 | replace_unk_words(train, word2index) 59 | replace_unk_words(validation, word2index) 60 | replace_unk_words(test, word2index) 61 | 62 | print('Turning words into word ids ...') 63 | train = replace_words_with_ids(train, word2index) 64 | validation = replace_words_with_ids(validation, word2index) 65 | test = replace_words_with_ids(test, word2index) 66 | 67 | print('Creating chunks ...') 68 | train = create_chunks(train, chunk_max_size=35) 69 | validation = create_chunks(validation, chunk_max_size=35) 70 | test = create_chunks(test, chunk_max_size=35) 71 | 72 | print('Creating labels ...') 73 | train, train_labels = create_labels(train) 74 | validation, validation_labels = create_labels(validation) 75 | test, test_labels = create_labels(test) 76 | 77 | print('Creating size list ...') 78 | train_sizes = get_sizes_list(train) 79 | validation_sizes = get_sizes_list(validation) 80 | test_sizes = get_sizes_list(test) 81 | 82 | train_all = (train, train_labels, train_sizes) 83 | validation_all = (validation, validation_labels, validation_sizes) 84 | test_all = (test, test_labels, test_sizes) 85 | 86 | dataset_save_path = data_folder / dataset_save_path 87 | 88 | print('Saving datasets ...') 89 | save_dataset(train_all, 'train', dataset_save_path) 90 | save_dataset(validation_all, 'validation', dataset_save_path) 91 | save_dataset(test_all, 'test', dataset_save_path) 92 | 93 | print('Creating tfrecords ...') 94 | create_tfrecord(train_all, 'train', dataset_save_path) 95 | create_tfrecord(validation_all, 'validation', dataset_save_path) 96 | create_tfrecord(test_all, 'test', dataset_save_path) 97 | 98 | 99 | def create_argparse(): 100 | parser = argparse.ArgumentParser() 101 | 102 | parser.add_argument('-df', 103 | '--data-folder', 104 | type=str, 105 | help='Location of the songs files') 106 | 107 | parser.add_argument('-dsp', 108 | '--dataset-save-path', 109 | type=str, 110 | help='Location to save the dataset files') 111 | 112 | parser.add_argument('-mf', 113 | '--min-frequency', 114 | type=int, 115 | help='Minimum word frequency required for a word to a part of the vocabulary') # noqa 116 | 117 | parser.add_argument('-vp', 118 | '--validation-percent', 119 | type=float, 120 | help='Percent of train dataset to use for validation') 121 | 122 | parser.add_argument('-tp', 123 | '--test-percent', 124 | type=float, 125 | help='Percent of train dataset to use for test') 126 | 127 | return parser 128 | 129 | 130 | def main(): 131 | parser = create_argparse() 132 | user_args = vars(parser.parse_args()) 133 | 134 | data_folder = user_args['data_folder'] 135 | dataset_save_path = Path(user_args['dataset_save_path']) 136 | music_dataset = MusicDataset(data_folder, dataset_save_path) 137 | 138 | validation_percent = user_args['validation_percent'] 139 | test_percent = user_args['test_percent'] 140 | music_dataset.create_dataset( 141 | validation_percent=validation_percent, test_percent=test_percent) 142 | music_dataset.display_info() 143 | 144 | train_dataset = music_dataset.train_dataset 145 | validation_dataset = music_dataset.validation_dataset 146 | test_dataset = music_dataset.test_dataset 147 | min_frequency = user_args['min_frequency'] 148 | 149 | full_preprocessing( 150 | train=train_dataset, 151 | validation=validation_dataset, 152 | test=test_dataset, 153 | data_folder=data_folder, 154 | dataset_save_path=dataset_save_path, 155 | min_frequency=min_frequency) 156 | 157 | 158 | if __name__ == '__main__': 159 | main() 160 | -------------------------------------------------------------------------------- /masks/romano.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/masks/romano.png -------------------------------------------------------------------------------- /model.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | from model.input_pipeline import InputPipeline 5 | from model.rnn import RecurrentModel, RecurrentConfig 6 | from model.song_generator import GreedySongGenerator 7 | from utils.session_manager import initialize_session 8 | 9 | 10 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 11 | 12 | 13 | def create_argparse(): 14 | argument_parser = argparse.ArgumentParser() 15 | 16 | argument_parser.add_argument('-tf', 17 | '--train-file', 18 | type=str, 19 | help='Location of the training file') 20 | 21 | argument_parser.add_argument('-vf', 22 | '--validation-file', 23 | type=str, 24 | help='Location of the validation file') 25 | 26 | argument_parser.add_argument('-tef', 27 | '--test-file', 28 | type=str, 29 | help='Location of the test file') 30 | 31 | argument_parser.add_argument('-chp', 32 | '--checkpoint-path', 33 | type=str, 34 | help="The path to save model's checkpoint") 35 | 36 | argument_parser.add_argument('-uch', 37 | '--use-checkpoint', 38 | type=int, 39 | help='If the model checkpoint should be loaded') 40 | 41 | argument_parser.add_argument('-i2w', 42 | '--index2word-path', 43 | type=str, 44 | help='Location of the index2word dict') 45 | 46 | argument_parser.add_argument('-w2i', 47 | '--word2index-path', 48 | type=str, 49 | help='Location of word2index dict') 50 | 51 | argument_parser.add_argument('-ne', 52 | '--num-epochs', 53 | type=int, 54 | help='Number of epochs to train') 55 | 56 | argument_parser.add_argument('-bs', 57 | '--batch-size', 58 | type=int, 59 | help='Batch size to use in the model') 60 | 61 | argument_parser.add_argument('-lr', 62 | '--learning-rate', 63 | type=float, 64 | help='Learning rate to use when training') 65 | 66 | argument_parser.add_argument('-nl', 67 | '--num-layers', 68 | type=int, 69 | help='Number of lstm layers to use') 70 | 71 | argument_parser.add_argument('-nu', 72 | '--num-units', 73 | type=int, 74 | help='Number of units to use in the lstm cell') 75 | 76 | argument_parser.add_argument('-vs', 77 | '--vocab-size', 78 | type=int, 79 | help='Size of the vocabulary') 80 | 81 | argument_parser.add_argument('-es', 82 | '--embedding-size', 83 | type=int, 84 | help='Dimension of the embedding matrix') 85 | 86 | argument_parser.add_argument('-ed', 87 | '--embedding-dropout', 88 | type=float, 89 | help='Embedding dropout') 90 | 91 | argument_parser.add_argument('-lod', 92 | '--lstm-output-dropout', 93 | type=float, 94 | help='LSTM output dropout') 95 | 96 | argument_parser.add_argument('-lid', 97 | '--lstm-input-dropout', 98 | type=float, 99 | help='LSTM input dropout') 100 | 101 | argument_parser.add_argument('-lsd', 102 | '--lstm-state-dropout', 103 | type=float, 104 | help='LSTM state dropout') 105 | 106 | argument_parser.add_argument('-wd', 107 | '--weight-decay', 108 | type=float, 109 | help='Weight decay') 110 | 111 | argument_parser.add_argument('-minv', 112 | '--min-val', 113 | type=int, 114 | help='Min value to use when initializing weights') 115 | 116 | argument_parser.add_argument('-maxv', 117 | '--max-val', 118 | type=int, 119 | help='Max value to use when initializing weights') 120 | 121 | argument_parser.add_argument('-nbc', 122 | '--num-buckets', 123 | type=int, 124 | help='Number of buckets to use') 125 | 126 | argument_parser.add_argument('-bcw', 127 | '--bucket-width', 128 | type=int, 129 | help='Number of elements allowed in bucket') 130 | 131 | argument_parser.add_argument('-pb', 132 | '--prefetch-buffer', 133 | type=int, 134 | help='Size of prefetch buffer') 135 | 136 | argument_parser.add_argument('-pf', 137 | '--perform-shuffle', 138 | type=int, 139 | help='If we shoudl shuffle the batches when training the model') 140 | 141 | return argument_parser 142 | 143 | 144 | def main(): 145 | argument_parser = create_argparse() 146 | user_args = vars(argument_parser.parse_args()) 147 | 148 | train_file = user_args['train_file'] 149 | validation_file = user_args['validation_file'] 150 | test_file = user_args['test_file'] 151 | batch_size = user_args['batch_size'] 152 | num_buckets = user_args['num_buckets'] 153 | bucket_width = user_args['bucket_width'] 154 | prefetch_buffer = user_args['prefetch_buffer'] 155 | perform_shuffle = True if user_args['perform_shuffle'] == 1 else False 156 | 157 | dataset = InputPipeline( 158 | train_files=train_file, 159 | validation_files=validation_file, 160 | test_files=test_file, 161 | batch_size=batch_size, 162 | perform_shuffle=perform_shuffle, 163 | bucket_width=bucket_width, 164 | num_buckets=num_buckets, 165 | prefetch_buffer=prefetch_buffer) 166 | 167 | dataset.build_pipeline() 168 | 169 | config = RecurrentConfig(user_args) 170 | model = RecurrentModel(dataset, config) 171 | model.build_graph() 172 | 173 | with initialize_session(config) as (sess, saver): 174 | model.fit(sess, saver) 175 | 176 | generator = GreedySongGenerator(model) 177 | print('Generating song (Greedy) ...') 178 | generator.generate(sess) 179 | 180 | 181 | if __name__ == '__main__': 182 | main() 183 | -------------------------------------------------------------------------------- /model/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/model/__init__.py -------------------------------------------------------------------------------- /model/input_pipeline.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | class SongDataset: 5 | 6 | def __init__(self, data, batch_size, perform_shuffle, 7 | bucket_width, num_buckets, prefetch_buffer): 8 | self.data = data 9 | self.batch_size = batch_size 10 | self.perform_shuffle = perform_shuffle 11 | self.bucket_width = bucket_width 12 | self.num_buckets = num_buckets 13 | self.prefetch_buffer = prefetch_buffer 14 | 15 | def parser(self, tfrecord): 16 | context_features = { 17 | 'size': tf.FixedLenFeature([], dtype=tf.int64), 18 | } 19 | 20 | sequence_features = { 21 | 'tokens': tf.FixedLenSequenceFeature([], dtype=tf.int64), 22 | 'labels': tf.FixedLenSequenceFeature([], dtype=tf.int64) 23 | } 24 | 25 | tfrecord_parsed = tf.parse_single_sequence_example( 26 | tfrecord, context_features, sequence_features) 27 | 28 | tokens = tfrecord_parsed[1]['tokens'] 29 | labels = tfrecord_parsed[1]['labels'] 30 | size = tfrecord_parsed[0]['size'] 31 | 32 | return tokens, labels, size 33 | 34 | def init_dataset(self): 35 | song_dataset = tf.data.TFRecordDataset(self.data) 36 | song_dataset = song_dataset.cache() 37 | song_dataset = song_dataset.map(self.parser, num_parallel_calls=8) 38 | 39 | return song_dataset 40 | 41 | def create_dataset(self): 42 | song_dataset = self.init_dataset() 43 | 44 | if self.perform_shuffle: 45 | song_dataset = song_dataset.shuffle(buffer_size=self.batch_size * 2) 46 | 47 | def batching_func(dataset): 48 | return dataset.padded_batch( 49 | self.batch_size, 50 | padded_shapes=( 51 | tf.TensorShape([None]), # token 52 | tf.TensorShape([None]), # label 53 | tf.TensorShape([])) # size 54 | ) 55 | 56 | def key_func(tokens, labels, size): 57 | bucket_id = size // self.bucket_width 58 | 59 | return tf.to_int64(tf.minimum(bucket_id, self.num_buckets)) 60 | 61 | def reduce_func(bucket_key, widowed_data): 62 | return batching_func(widowed_data) 63 | 64 | song_dataset = song_dataset.apply( 65 | tf.contrib.data.group_by_window( 66 | key_func=key_func, reduce_func=reduce_func, window_size=self.batch_size)) 67 | 68 | self.song_dataset = song_dataset.prefetch(self.prefetch_buffer) 69 | 70 | return self.song_dataset 71 | 72 | 73 | class InputPipeline: 74 | 75 | def __init__(self, train_files, validation_files, test_files, batch_size, perform_shuffle, 76 | bucket_width, num_buckets, prefetch_buffer): 77 | self.train_files = train_files 78 | self.validation_files = validation_files 79 | self.test_files = test_files 80 | self.batch_size = batch_size 81 | self.perform_shuffle = perform_shuffle 82 | self.bucket_width = bucket_width 83 | self.num_buckets = num_buckets 84 | self.prefetch_buffer = prefetch_buffer 85 | 86 | self._train_iterator_op = None 87 | self._validation_iterator_op = None 88 | self._test_iterator_op = None 89 | 90 | @property 91 | def train_iterator(self): 92 | return self._train_iterator_op 93 | 94 | @property 95 | def validation_iterator(self): 96 | return self._validation_iterator_op 97 | 98 | @property 99 | def test_iterator(self): 100 | return self._test_iterator_op 101 | 102 | def create_datasets(self, dataset=SongDataset): 103 | train_dataset = dataset( 104 | self.train_files, self.batch_size, self.perform_shuffle, 105 | self.bucket_width, self.num_buckets, self.prefetch_buffer) 106 | validation_dataset = dataset( 107 | self.validation_files, self.batch_size, False, 108 | self.bucket_width, self.num_buckets, self.prefetch_buffer) 109 | test_dataset = dataset( 110 | self.test_files, self.batch_size, False, 111 | self.bucket_width, self.num_buckets, self.prefetch_buffer) 112 | 113 | self.train_dataset = train_dataset.create_dataset() 114 | self.validation_dataset = validation_dataset.create_dataset() 115 | self.test_dataset = test_dataset.create_dataset() 116 | 117 | def create_iterator(self): 118 | self._train_iterator_op = self.train_dataset.make_initializable_iterator() 119 | self._validation_iterator_op = self.validation_dataset.make_initializable_iterator() 120 | self._test_iterator_op = self.test_dataset.make_initializable_iterator() 121 | 122 | def get_num_batches(self, iterator): 123 | with tf.Session() as sess: 124 | num_batches = 0 125 | sess.run(iterator.initializer) 126 | 127 | while True: 128 | try: 129 | _, _, _ = sess.run(iterator.get_next()) 130 | num_batches += 1 131 | except tf.errors.OutOfRangeError: 132 | break 133 | 134 | return num_batches 135 | 136 | def get_datasets_num_batches(self): 137 | self.train_batches = self.get_num_batches(self.train_iterator) 138 | self.validation_batches = self.get_num_batches(self.validation_iterator) 139 | self.test_batches = self.get_num_batches(self.test_iterator) 140 | 141 | def build_pipeline(self): 142 | self.create_datasets() 143 | self.create_iterator() 144 | -------------------------------------------------------------------------------- /model/rnn.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | from model.song_model import ModelConfig, SongLyricsModel 5 | 6 | 7 | class RecurrentConfig(ModelConfig): 8 | 9 | def __init__(self, model_params): 10 | super().__init__(model_params) 11 | 12 | self.num_layers = model_params['num_layers'] 13 | self.num_units = model_params['num_units'] 14 | self.embedding_dropout = model_params['embedding_dropout'] 15 | self.lstm_state_dropout = model_params['lstm_state_dropout'] 16 | self.lstm_input_dropout = model_params['lstm_input_dropout'] 17 | self.lstm_output_dropout = model_params['lstm_output_dropout'] 18 | self.weight_decay = model_params['weight_decay'] 19 | self.min_val = model_params['min_val'] 20 | self.max_val = model_params['max_val'] 21 | 22 | 23 | class RecurrentModel(SongLyricsModel): 24 | 25 | def add_placeholders_op(self): 26 | self.embedding_dropout_placeholder = tf.placeholder( 27 | tf.float32, name='embedding_dropout') 28 | self.lstm_state_dropout_placeholder = tf.placeholder( 29 | tf.float32, name='lstm_state_dropout') 30 | self.lstm_input_dropout_placeholder = tf.placeholder( 31 | tf.float32, name='lstm_input_dropout') 32 | self.lstm_output_dropout_placeholder = tf.placeholder( 33 | tf.float32, name='lstm_output_dropout') 34 | 35 | def create_train_feed_dict(self): 36 | feed_dict = { 37 | self.embedding_dropout_placeholder: self.config.embedding_dropout, 38 | self.lstm_state_dropout_placeholder: self.config.lstm_state_dropout, 39 | self.lstm_input_dropout_placeholder: self.config.lstm_input_dropout, 40 | self.lstm_output_dropout_placeholder: self.config.lstm_output_dropout, 41 | } 42 | 43 | return feed_dict 44 | 45 | def create_validation_feed_dict(self): 46 | feed_dict = { 47 | self.embedding_dropout_placeholder: 1.0, 48 | self.lstm_state_dropout_placeholder: 1.0, 49 | self.lstm_input_dropout_placeholder: 1.0, 50 | self.lstm_output_dropout_placeholder: 1.0, 51 | } 52 | 53 | return feed_dict 54 | 55 | def create_generate_feed_dict(self, data, temperature, state): 56 | feed_dict = self.create_validation_feed_dict() 57 | 58 | feed_dict[self.data_placeholder] = data 59 | feed_dict[self.temperature_placeholder] = temperature 60 | feed_dict[self.initial_state] = state 61 | 62 | return feed_dict 63 | 64 | def add_embeddings_op(self, data_batch): 65 | with tf.name_scope('embeddings'): 66 | self.embeddings = tf.get_variable( 67 | 'embeddings', 68 | initializer=tf.random_uniform_initializer( 69 | minval=self.config.min_val, maxval=self.config.max_val), 70 | shape=(self.config.vocab_size, self.config.embedding_size), 71 | dtype=tf.float32 72 | ) 73 | 74 | self.embeddings_dropout = tf.nn.dropout( 75 | self.embeddings, keep_prob=self.embedding_dropout_placeholder) 76 | 77 | inputs = tf.nn.embedding_lookup( 78 | self.embeddings_dropout, data_batch) 79 | 80 | return inputs 81 | 82 | def add_logits_op(self, data_batch, size_batch, reuse=False): 83 | with tf.variable_scope('logits', reuse=reuse): 84 | data_embeddings = self.add_embeddings_op(data_batch) 85 | 86 | with tf.name_scope('recurrent_layer'): 87 | def make_cell(input_size): 88 | lstm_cell = tf.nn.rnn_cell.LSTMCell( 89 | self.config.num_units) 90 | drop_cell = tf.nn.rnn_cell.DropoutWrapper( 91 | lstm_cell, 92 | state_keep_prob=self.lstm_state_dropout_placeholder, 93 | output_keep_prob=self.lstm_output_dropout_placeholder, 94 | variational_recurrent=True, 95 | input_size=input_size, 96 | dtype=tf.float32) 97 | 98 | return drop_cell 99 | 100 | input_sizes = [ 101 | self.config.embedding_size, self.config.num_units, self.config.num_units 102 | ] 103 | self.cell = tf.nn.rnn_cell.MultiRNNCell( 104 | [make_cell(input_sizes[i]) for i in range(self.config.num_layers)]) 105 | 106 | self.initial_state = self.cell.zero_state( 107 | tf.shape(data_batch)[0], tf.float32) 108 | 109 | outputs, final_state = tf.nn.dynamic_rnn( 110 | self.cell, 111 | data_embeddings, 112 | sequence_length=size_batch, 113 | initial_state=self.initial_state, 114 | dtype=tf.float32 115 | ) 116 | 117 | with tf.name_scope('logits'): 118 | flat_outputs = tf.reshape(outputs, [-1, self.config.num_units]) 119 | 120 | weights = tf.get_variable( 121 | 'weights', 122 | initializer=tf.contrib.layers.xavier_initializer(), 123 | shape=(self.config.num_units, self.config.embedding_size), 124 | dtype=tf.float32) 125 | 126 | bias = tf.get_variable( 127 | 'bias', 128 | initializer=tf.ones_initializer(), 129 | shape=(self.config.embedding_size), 130 | dtype=tf.float32) 131 | 132 | flat_inputs = tf.matmul( 133 | flat_outputs, weights) + bias 134 | 135 | bias_logits = tf.get_variable( 136 | 'bias_logits', 137 | initializer=tf.ones_initializer(), 138 | shape=(self.config.vocab_size), 139 | dtype=tf.float32) 140 | 141 | flat_logits = tf.matmul( 142 | flat_inputs, tf.transpose(self.embeddings)) + bias_logits 143 | 144 | batch_size = tf.shape(data_batch)[0] 145 | max_len = tf.shape(data_batch)[1] 146 | 147 | logits = tf.reshape( 148 | flat_logits, [batch_size, max_len, self.config.vocab_size]) 149 | 150 | return logits, final_state 151 | 152 | def add_loss_op(self, logits, labels_batch, size_batch): 153 | with tf.name_scope('loss'): 154 | weights = tf.sequence_mask(size_batch, dtype=tf.float32) 155 | 156 | seq_loss = tf.contrib.seq2seq.sequence_loss( 157 | logits=logits, 158 | targets=labels_batch, 159 | weights=weights 160 | ) 161 | 162 | loss = tf.reduce_sum(seq_loss) 163 | 164 | return loss 165 | 166 | def add_l2_regularizer_op(self, loss): 167 | l2_loss = self.config.weight_decay * tf.add_n( 168 | [tf.nn.l2_loss(v) for v in tf.trainable_variables()]) 169 | 170 | return loss + l2_loss 171 | 172 | def add_train_op(self, loss): 173 | optimizer = tf.train.AdamOptimizer(learning_rate=self.config.learning_rate) 174 | optimizer_op = optimizer.minimize(loss) 175 | 176 | return optimizer_op 177 | -------------------------------------------------------------------------------- /model/sample_generator.py: -------------------------------------------------------------------------------- 1 | from model.rnn import RecurrentModel, RecurrentConfig 2 | from model.song_generator import GreedySongGenerator 3 | 4 | import tensorflow as tf 5 | 6 | 7 | def create_sample(model_config, html=False): 8 | sample_config = RecurrentConfig(model_config) 9 | graph = tf.Graph() 10 | 11 | with graph.as_default(): 12 | model = RecurrentModel(None, sample_config) 13 | model.build_placeholders() 14 | model.build_generate_graph(reuse=False) 15 | 16 | config = tf.ConfigProto(device_count={'GPU': 0}) 17 | sess = tf.Session(config=config) 18 | 19 | checkpoint = tf.train.latest_checkpoint(sample_config.checkpoint_path) 20 | saver = tf.train.Saver() 21 | saver.restore(sess, checkpoint) 22 | generator = GreedySongGenerator(model) 23 | 24 | def sample(prime_words, html): 25 | prime_words = prime_words.split() 26 | with graph.as_default(): 27 | return generator.generate(sess, prime_words=prime_words, html=html) 28 | 29 | return sample 30 | -------------------------------------------------------------------------------- /model/song_generator.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | import tensorflow as tf 3 | 4 | 5 | class GreedySongGenerator: 6 | 7 | def __init__(self, model): 8 | self.model = model 9 | 10 | def parse_song(self, song_list, html): 11 | parsed_song = [] 12 | is_mc = False 13 | 14 | for index in range(len(song_list)): 15 | curr_word = song_list[index] 16 | 17 | if index == 0: 18 | parsed_song.append(curr_word) 19 | continue 20 | 21 | if curr_word[0].isupper(): 22 | if not html: 23 | parsed_song.append('\n') 24 | else: 25 | parsed_song.append('
') 26 | 27 | if is_mc: 28 | parsed_song.append('Neural') 29 | is_mc = False 30 | else: 31 | parsed_song.append(curr_word) 32 | 33 | if curr_word.lower() == 'mc': 34 | is_mc = True 35 | 36 | return ' '.join(parsed_song) 37 | 38 | def weighted_pick(self, weights): 39 | t = np.cumsum(weights) 40 | s = np.sum(weights) 41 | return(int(np.searchsorted(t, np.random.rand(1)*s))) 42 | 43 | def create_initial_state(self, sess): 44 | state = sess.run(self.model.cell.zero_state(1, tf.float32)) 45 | word = self.model.word2index[''] 46 | 47 | return state, word 48 | 49 | def create_prime_state(self, sess, prime_words, temperature): 50 | state = sess.run(self.model.cell.zero_state(1, tf.float32)) 51 | 52 | probs = None 53 | for word in prime_words: 54 | id_word = self.model.word2index.get(word, -1) 55 | 56 | if id_word == -1: 57 | continue 58 | 59 | probs, state = self.model.predict(sess, state, id_word, temperature) 60 | 61 | if probs is not None: 62 | while True: 63 | generated_word_id = self.weighted_pick(probs) 64 | 65 | if generated_word_id != 1: 66 | break 67 | else: 68 | return self.create_initial_state(sess) 69 | 70 | return state, generated_word_id 71 | 72 | def generate(self, sess, prime_words=None, temperature=0.7, num_out=200, html=False): 73 | song = [] 74 | current_word = "" 75 | repetition_counter = 0 76 | unk_count = 0 77 | 78 | sequences = [] 79 | sequence = "" 80 | restart = False 81 | 82 | if prime_words: 83 | state, word = self.create_prime_state(sess, prime_words, temperature) 84 | song.extend(prime_words) 85 | else: 86 | state, word = self.create_initial_state(sess) 87 | 88 | for i in range(num_out): 89 | probs, state = self.model.predict(sess, state, word, temperature) 90 | probs = probs[0].reshape(-1) 91 | 92 | while True: 93 | generated_word_id = self.weighted_pick(probs) 94 | generated_word = str(self.model.index2word.get(generated_word_id, 1)) 95 | 96 | if generated_word == '': 97 | unk_count += 0 98 | 99 | if unk_count >= 150: 100 | return -1 101 | 102 | continue 103 | 104 | if generated_word == '' and len(song) < 100: 105 | continue 106 | 107 | if generated_word.lower() != current_word.lower(): 108 | current_word = generated_word 109 | repetition_counter = 0 110 | elif current_word != '': 111 | repetition_counter += 1 112 | 113 | if repetition_counter >= 5: 114 | if repetition_counter >= 100: 115 | return -1 116 | continue 117 | 118 | if generated_word != '': 119 | unk_count = 0 120 | break 121 | 122 | word = generated_word_id 123 | 124 | if generated_word[0].isupper(): 125 | 126 | if sequences.count(sequence) >= 3: 127 | state, word = self.create_initial_state(sess) 128 | restart = True 129 | 130 | sequences.append(sequence) 131 | sequence = generated_word 132 | 133 | if restart: 134 | restart = False 135 | continue 136 | 137 | else: 138 | sequence += generated_word 139 | 140 | if generated_word == '': 141 | break 142 | 143 | song.append(str(generated_word)) 144 | 145 | return self.parse_song(song, html) 146 | -------------------------------------------------------------------------------- /model/song_model.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | import random 3 | import os 4 | 5 | import numpy as np 6 | import tensorflow as tf 7 | 8 | 9 | class ModelConfig: 10 | 11 | def __init__(self, model_params): 12 | self.vocab_size = model_params['vocab_size'] 13 | self.embedding_size = model_params['embedding_size'] 14 | self.learning_rate = model_params['learning_rate'] 15 | self.num_epochs = model_params['num_epochs'] 16 | self.use_checkpoint = model_params['use_checkpoint'] 17 | self.index2word_path = model_params['index2word_path'] 18 | self.word2index_path = model_params['word2index_path'] 19 | self.checkpoint_path = model_params['checkpoint_path'] 20 | 21 | 22 | class SongLyricsModel: 23 | 24 | def __init__(self, dataset, config): 25 | self.dataset = dataset 26 | self.config = config 27 | 28 | self.index2word = self.load_dict(self.config.index2word_path) 29 | self.word2index = self.load_dict(self.config.word2index_path) 30 | 31 | def load_dict(self, path): 32 | with open(path, 'rb') as f: 33 | return pickle.load(f) 34 | 35 | def add_embedding_op(self, data_batch): 36 | raise NotImplementedError 37 | 38 | def add_logits_op(self, data_batch, size_batch, reuse=False): 39 | raise NotImplementedError 40 | 41 | def add_loss_op(self, logits, labels_batch, size_batch): 42 | raise NotImplementedError 43 | 44 | def add_train_op(self, loss): 45 | raise NotImplementedError 46 | 47 | def run_epoch(self, sess, ops, feed_dict, training=True): 48 | costs = 0 49 | num_iters = 0 50 | state = None 51 | 52 | while True: 53 | try: 54 | 55 | if random.random() >= 0.01 and state is not None: 56 | feed_dict[self.initial_state] = state 57 | 58 | _, batch_loss, state = sess.run(ops, feed_dict=feed_dict) 59 | 60 | costs += batch_loss 61 | num_iters += 1 62 | 63 | except tf.errors.OutOfRangeError: 64 | return np.exp(costs / num_iters) 65 | 66 | def fit(self, sess, saver): 67 | best_perplexity = 10000000 68 | train_feed_dict = self.create_train_feed_dict() 69 | 70 | for i in range(self.config.num_epochs): 71 | print('Running epoch: {}'.format(i + 1)) 72 | ops = [self.train_op, self.train_loss, self.train_state] 73 | sess.run(self.train_iterator.initializer) 74 | train_perplexity = self.run_epoch(sess, ops, train_feed_dict) 75 | print('Train perplexity: {:.3f}'.format(train_perplexity)) 76 | 77 | if train_perplexity < best_perplexity: 78 | best_perplexity = train_perplexity 79 | print('New best perplexity found ! {:.3f}'.format(best_perplexity)) 80 | 81 | saver.save( 82 | sess, os.path.join(self.config.checkpoint_path, 'song_model.ckpt')) 83 | 84 | def predict(self, sess, state, word, temperature): 85 | input_word_id = np.array([[word]]) 86 | feed_dict = self.create_generate_feed_dict( 87 | data=input_word_id, 88 | temperature=temperature, 89 | state=state) 90 | 91 | probs, state = sess.run( 92 | [self.generate_predictions, self.generate_state], feed_dict=feed_dict) 93 | 94 | return probs, state 95 | 96 | def build_placeholders(self): 97 | with tf.name_scope('placeholders'): 98 | self.add_placeholders_op() 99 | 100 | self.data_placeholder = tf.placeholder(tf.int32, [None, 1]) 101 | self.temperature_placeholder = tf.placeholder(tf.float32) 102 | 103 | def build_generate_graph(self, reuse=True): 104 | with tf.name_scope('generate'): 105 | 106 | generate_size = np.array([1]) 107 | generate_logits, self.generate_state = self.add_logits_op( 108 | self.data_placeholder, generate_size, reuse=reuse) 109 | 110 | temperature_logits = tf.div(generate_logits, self.temperature_placeholder) 111 | self.generate_predictions = tf.nn.softmax(temperature_logits) 112 | 113 | def build_graph(self): 114 | self.build_placeholders() 115 | 116 | with tf.name_scope('iterator'): 117 | self.train_iterator = self.dataset.train_iterator 118 | self.validation_iterator = self.dataset.validation_iterator 119 | self.test_iterator = self.dataset.test_iterator 120 | 121 | with tf.name_scope('train_data'): 122 | train_data, train_labels, train_size = self.train_iterator.get_next() 123 | 124 | with tf.name_scope('validation_data'): 125 | (validation_data, validation_labels, 126 | validation_size) = self.validation_iterator.get_next() 127 | 128 | with tf.name_scope('test_data'): 129 | test_data, test_labels, test_size = self.test_iterator.get_next() 130 | 131 | with tf.name_scope('train'): 132 | train_logits, self.train_state = self.add_logits_op(train_data, train_size) 133 | self.train_loss = self.add_loss_op(train_logits, train_labels, train_size) 134 | train_l2_loss = self.add_l2_regularizer_op(self.train_loss) 135 | self.train_op = self.add_train_op(train_l2_loss) 136 | 137 | with tf.name_scope('validation'): 138 | validation_logits, self.validation_state = self.add_logits_op( 139 | validation_data, validation_size, reuse=True) 140 | self.validation_loss = self.add_loss_op( 141 | validation_logits, validation_labels, validation_size) 142 | 143 | self.build_generate_graph() 144 | -------------------------------------------------------------------------------- /preprocessing/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/preprocessing/__init__.py -------------------------------------------------------------------------------- /preprocessing/dataset.py: -------------------------------------------------------------------------------- 1 | import os 2 | import random 3 | import pickle 4 | import re 5 | 6 | from pathlib import Path 7 | 8 | 9 | class MusicDataset: 10 | 11 | def __init__(self, data_folder, dataset_save_path): 12 | self.data_folder = Path(data_folder) 13 | self.dataset_save_path = self.data_folder / dataset_save_path 14 | 15 | def parse_song(self, song_text): 16 | parsed_song = [] 17 | 18 | for lines in song_text.split('\n'): 19 | words = lines.split() 20 | 21 | if not words: 22 | continue 23 | 24 | parsed_words = [words[0]] 25 | 26 | for word in words[1:]: 27 | parsed_words.append(word.lower()) 28 | 29 | parsed_song.append(' '.join(parsed_words)) 30 | 31 | return '\n'.join(parsed_song) 32 | 33 | def read_song(self, song_path): 34 | with song_path.open() as song_file: 35 | return self.parse_song(song_file.read()) 36 | 37 | def get_num_words_from_song(self, song): 38 | return len(song.replace('\n', ' ').split(' ')) 39 | 40 | def format_song_text(self, song_text): 41 | song_text = re.sub(r",", "", song_text) 42 | song_text = re.sub(r"!", " ! ", song_text) 43 | song_text = re.sub(r"\?", " ? ", song_text) 44 | song_text = re.sub(r"\)", "", song_text) 45 | song_text = re.sub(r"\(", "", song_text) 46 | song_text = re.sub(r"\}", "", song_text) 47 | song_text = re.sub(r"\{", "", song_text) 48 | song_text = re.sub(r":", "", song_text) 49 | song_text = re.sub(r"\.", " ", song_text) 50 | song_text = re.sub(r"\n", " \n ", song_text) 51 | song_text = re.sub(r'"', " ", song_text) 52 | song_text = re.sub(r"\[.*\]", " ", song_text) 53 | 54 | song_text = ' ' + song_text + ' ' 55 | song_text = re.sub(r'\s{2,}', ' ', song_text) 56 | 57 | song_text = song_text.split(' ') 58 | 59 | return song_text 60 | 61 | def get_songs(self): 62 | self.all_songs = [] 63 | 64 | for artist in os.listdir(str(self.data_folder)): 65 | artist_path = self.data_folder / artist 66 | 67 | if not artist_path.is_dir(): 68 | continue 69 | 70 | for song in os.listdir(str(artist_path)): 71 | if song == 'song_codes.txt' or song == 'song_names.txt': 72 | continue 73 | 74 | song_path = artist_path / song 75 | song_text = self.read_song(song_path) 76 | 77 | song_text = self.format_song_text(song_text) 78 | self.all_songs.append(song_text) 79 | 80 | def average_song_size(self): 81 | avg_size = 0 82 | for song in self.all_songs: 83 | avg_size += len(song) 84 | 85 | return avg_size / len(self.all_songs) 86 | 87 | def load_dataset(self, dataset_type): 88 | dataset_path = self.dataset_save_path / dataset_type 89 | 90 | with dataset_path.open(mode='rb') as dataset_file: 91 | return pickle.load(dataset_file) 92 | 93 | def load_datasets(self): 94 | if not self.dataset_save_path.is_dir(): 95 | return False 96 | 97 | print('Loading datasets ...') 98 | self.all_songs = self.load_dataset('all_songs.pkl') 99 | self.train_dataset = self.load_dataset('train/raw_train.pkl') 100 | self.validation_dataset = self.load_dataset('validation/raw_validation.pkl') 101 | self.test_dataset = self.load_dataset('test/raw_test.pkl') 102 | 103 | return True 104 | 105 | def save_dataset(self, dataset, dataset_path): 106 | with dataset_path.open(mode='wb') as dataset_file: 107 | pickle.dump(dataset, dataset_file) 108 | 109 | def create_dirs(self): 110 | if not self.dataset_save_path.is_dir(): 111 | self.dataset_save_path.mkdir() 112 | 113 | train_dataset = self.dataset_save_path / 'train' 114 | if not train_dataset.is_dir(): 115 | train_dataset.mkdir() 116 | 117 | validation_dataset = self.dataset_save_path / 'validation' 118 | if not validation_dataset.is_dir(): 119 | validation_dataset.mkdir() 120 | 121 | test_dataset = self.dataset_save_path / 'test' 122 | if not test_dataset.is_dir(): 123 | test_dataset.mkdir() 124 | 125 | return train_dataset, validation_dataset, test_dataset 126 | 127 | def save_datasets(self): 128 | 129 | (train_dataset_path, validation_dataset_path, 130 | test_dataset_path) = self.create_dirs() 131 | 132 | save_path = self.dataset_save_path / 'all_songs.pkl' 133 | self.save_dataset(self.all_songs, save_path) 134 | 135 | train_dataset_path = train_dataset_path / 'raw_train.pkl' 136 | self.save_dataset(self.train_dataset, train_dataset_path) 137 | 138 | validation_dataset_path = validation_dataset_path / 'raw_validation.pkl' 139 | self.save_dataset(self.validation_dataset, validation_dataset_path) 140 | 141 | test_dataset_path = test_dataset_path / 'raw_test.pkl' 142 | self.save_dataset(self.test_dataset, test_dataset_path) 143 | 144 | def split_dataset(self, validation_percent, test_percent): 145 | total_size = validation_percent + test_percent 146 | 147 | random.shuffle(self.all_songs) 148 | train_size = len(self.all_songs) - int(len(self.all_songs) * total_size) 149 | self.train_dataset = self.all_songs[:train_size] 150 | 151 | validation_size = train_size + int(len(self.all_songs) * validation_percent) 152 | self.validation_dataset = self.all_songs[train_size: validation_size] 153 | 154 | self.test_dataset = self.all_songs[validation_size:] 155 | 156 | def create_dataset(self, validation_percent, test_percent): 157 | if not self.load_datasets(): 158 | print('Creating dataset ...') 159 | self.get_songs() 160 | self.split_dataset(validation_percent, test_percent) 161 | self.save_datasets() 162 | 163 | def display_info(self): 164 | print('Total number of songs: {}'.format(len(self.all_songs))) 165 | print('Average size of songs: {} words'.format(int(self.average_song_size()))) 166 | 167 | print('Train size: {}'.format(len(self.train_dataset))) 168 | print('Validation size: {}'.format(len(self.validation_dataset))) 169 | print('Test size: {}'.format(len(self.test_dataset))) 170 | -------------------------------------------------------------------------------- /preprocessing/text_preprocessing.py: -------------------------------------------------------------------------------- 1 | import pickle 2 | 3 | from tensorflow.contrib import learn 4 | 5 | 6 | def get_vocabulary(text_array, min_frequency): 7 | def tokenizer_fn(iterator): 8 | return (x.split(' ') for x in iterator) 9 | 10 | max_size = max([len(review) for review in text_array]) 11 | text_array = [' '.join(text) for text in text_array] 12 | 13 | vocabulary_processor = learn.preprocessing.VocabularyProcessor( 14 | max_size, tokenizer_fn=tokenizer_fn, min_frequency=min_frequency) 15 | 16 | vocabulary_processor.fit(text_array) 17 | 18 | vocab = vocabulary_processor.vocabulary_._mapping 19 | sorted_vocab = sorted(vocab.items(), key=lambda x: x[1]) 20 | 21 | return sorted_vocab 22 | 23 | 24 | def create_word_dictionaties(vocab): 25 | word2index = {word: index + 1 for (word, index) in vocab} 26 | index2word = {index: word for word, index in word2index.items()} 27 | 28 | return word2index, index2word 29 | 30 | 31 | def save(data, data_path): 32 | with data_path.open('wb') as data_file: 33 | pickle.dump(data, data_file) 34 | 35 | 36 | def replace_unk_words(dataset, word2index): 37 | for text in dataset: 38 | for index_word, word in enumerate(text[:]): 39 | if word not in word2index: 40 | text[index_word] = '' 41 | 42 | 43 | def replace_words_with_ids(dataset, word2index): 44 | word_id_dataset = [] 45 | 46 | for data in dataset: 47 | word_id = [word2index[word] for word in data] 48 | word_id_dataset.append(word_id) 49 | 50 | return word_id_dataset 51 | 52 | 53 | def get_sizes_list(dataset): 54 | return [len(data) for data in dataset] 55 | 56 | 57 | def create_chunks(dataset, chunk_max_size): 58 | chunks_dataset = [] 59 | 60 | for song in dataset: 61 | chunks = [song[x:x + chunk_max_size] 62 | for x in range(0, len(song), chunk_max_size)] 63 | chunks_dataset.extend(chunks) 64 | 65 | return chunks_dataset 66 | 67 | 68 | def create_labels(dataset): 69 | new_dataset = [] 70 | labels = [] 71 | 72 | for data in dataset: 73 | new_dataset.append(data[0:-1]) 74 | labels.append(data[1:]) 75 | 76 | return new_dataset, labels 77 | -------------------------------------------------------------------------------- /preprocessing/tfrecord.py: -------------------------------------------------------------------------------- 1 | import tensorflow as tf 2 | 3 | 4 | class SentenceTFRecord(): 5 | def __init__(self, dataset, output_path): 6 | self.dataset = dataset 7 | self.output_path = output_path 8 | 9 | def parse_sentences(self): 10 | writer = tf.python_io.TFRecordWriter(self.output_path) 11 | 12 | all_data, all_labels, all_sizes = self.dataset 13 | 14 | for data, labels, size in zip(all_data, all_labels, all_sizes): 15 | example = self.make_example(data, labels, size) 16 | writer.write(example.SerializeToString()) 17 | 18 | writer.close() 19 | 20 | def make_example(self, data, labels, size): 21 | example = tf.train.SequenceExample() 22 | 23 | example.context.feature['size'].int64_list.value.append(size) 24 | 25 | sentence_tokens = example.feature_lists.feature_list['tokens'] 26 | labels_tokens = example.feature_lists.feature_list['labels'] 27 | 28 | for (token, label) in zip(data, labels): 29 | sentence_tokens.feature.add().int64_list.value.append(int(token)) 30 | labels_tokens.feature.add().int64_list.value.append(int(label)) 31 | 32 | return example 33 | -------------------------------------------------------------------------------- /requirements.txt: -------------------------------------------------------------------------------- 1 | tensorflow-gpu==1.4.1 2 | beautifulsoup4 3 | requests 4 | flask 5 | flask-corps 6 | nltk 7 | wordcloud 8 | flake8 9 | -------------------------------------------------------------------------------- /sample.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import os 3 | 4 | from collections import defaultdict 5 | 6 | from model.rnn import RecurrentModel, RecurrentConfig 7 | from model.song_generator import GreedySongGenerator 8 | from utils.session_manager import initialize_session 9 | 10 | os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' 11 | 12 | 13 | def create_argparse(): 14 | argument_parser = argparse.ArgumentParser() 15 | 16 | argument_parser.add_argument('-chp', 17 | '--checkpoint-path', 18 | type=str, 19 | help="The path to save model's checkpoint") 20 | 21 | argument_parser.add_argument('-i2w', 22 | '--index2word-path', 23 | type=str, 24 | help='Location of the index2word dict') 25 | 26 | argument_parser.add_argument('-w2i', 27 | '--word2index-path', 28 | type=str, 29 | help='Location of word2index dict') 30 | 31 | argument_parser.add_argument('-vs', 32 | '--vocab-size', 33 | type=int, 34 | help='Size of the vocabulary') 35 | 36 | argument_parser.add_argument('-es', 37 | '--embedding-size', 38 | type=int, 39 | help='Dimension of the embedding matrix') 40 | 41 | argument_parser.add_argument('-nl', 42 | '--num-layers', 43 | type=int, 44 | help='Number of lstm layers to use') 45 | 46 | argument_parser.add_argument('-nu', 47 | '--num-units', 48 | type=int, 49 | help='Number of units to use in the lstm cell') 50 | 51 | argument_parser.add_argument('-t', 52 | '--temperature', 53 | type=float, 54 | help='Logits temperature') 55 | 56 | return argument_parser 57 | 58 | 59 | def main(): 60 | argument_parser = create_argparse() 61 | user_args = vars(argument_parser.parse_args()) 62 | user_args['use_checkpoint'] = True 63 | prime_words = [] 64 | 65 | user_args = defaultdict(int, user_args) 66 | 67 | config = RecurrentConfig(user_args) 68 | model = RecurrentModel(None, config) 69 | model.build_placeholders() 70 | model.build_generate_graph(reuse=False) 71 | 72 | with initialize_session(config, use_gpu=False) as (sess, saver): 73 | generator = GreedySongGenerator(model) 74 | temperature = user_args['temperature'] 75 | 76 | print('Generating song (Greedy) ...') 77 | print(generator.generate(sess, prime_words=prime_words, temperature=temperature)) 78 | 79 | 80 | if __name__ == '__main__': 81 | main() 82 | -------------------------------------------------------------------------------- /scripts/create_sample.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | #usage: ./scripts/create_sample.sh 6 | 7 | ALL_INDEX2WORD_PATH='data/song_dataset/index2word.pkl' 8 | ALL_WORD2INDEX_PATH='data/song_dataset/word2index.pkl' 9 | ALL_CHECKPOINT_PATH='good_checkpoint' 10 | ALL_VOCAB_SIZE=12399 11 | 12 | KONDZILLA_INDEX2WORD_PATH='kondzilla/song_dataset/index2word.pkl' 13 | KONDZILLA_WORD2INDEX_PATH='kondzilla/song_dataset/word2index.pkl' 14 | KONDZILLA_CHECKPOINT_PATH='kondzilla_checkpoint' 15 | KONDZILLA_VOCAB_SIZE=2192 16 | 17 | PROIBIDAO_INDEX2WORD_PATH='proibidao-songs/song_dataset/index2word.pkl' 18 | PROIBIDAO_WORD2INDEX_PATH='proibidao-songs/song_dataset/word2index.pkl' 19 | PROIBIDAO_CHECKPOINT_PATH='proibidao_checkpoint' 20 | PROIBIDAO_VOCAB_SIZE=1445 21 | 22 | OSTENTACAO_INDEX2WORD_PATH='ostentacao-songs/song_dataset/index2word.pkl' 23 | OSTENTACAO_WORD2INDEX_PATH='ostentacao-songs/song_dataset/word2index.pkl' 24 | OSTENTACAO_CHECKPOINT_PATH='ostentacao_checkpoint' 25 | OSTENTACAO_VOCAB_SIZE=2035 26 | 27 | EMBEDDING_SIZE=300 28 | NUM_LAYERS=3 29 | NUM_UNITS=728 30 | TEMPERATURE=0.7 31 | 32 | PARAM=${1:-all} 33 | if [ $PARAM == "all" ]; then 34 | echo "Creating sample for all songs model" 35 | INDEX2WORD_PATH=$ALL_INDEX2WORD_PATH 36 | WORD2INDEX_PATH=$ALL_WORD2INDEX_PATH 37 | CHECKPOINT_PATH=$ALL_CHECKPOINT_PATH 38 | VOCAB_SIZE=$ALL_VOCAB_SIZE 39 | elif [ $PARAM == "kondzilla" ]; then 40 | echo "Creating sample for kondzilla songs model" 41 | INDEX2WORD_PATH=$KONDZILLA_INDEX2WORD_PATH 42 | WORD2INDEX_PATH=$KONDZILLA_WORD2INDEX_PATH 43 | CHECKPOINT_PATH=$KONDZILLA_CHECKPOINT_PATH 44 | VOCAB_SIZE=$KONDZILLA_VOCAB_SIZE 45 | elif [ $PARAM == "proibidao" ]; then 46 | echo "Creating sample for probidao songs model" 47 | INDEX2WORD_PATH=$PROIBIDAO_INDEX2WORD_PATH 48 | WORD2INDEX_PATH=$PROIBIDAO_WORD2INDEX_PATH 49 | CHECKPOINT_PATH=$PROIBIDAO_CHECKPOINT_PATH 50 | VOCAB_SIZE=$PROIBIDAO_VOCAB_SIZE 51 | elif [ $PARAM == "ostentacao" ]; then 52 | echo "Creating sample for ostentacao songs model" 53 | INDEX2WORD_PATH=$OSTENTACAO_INDEX2WORD_PATH 54 | WORD2INDEX_PATH=$OSTENTACAO_WORD2INDEX_PATH 55 | CHECKPOINT_PATH=$OSTENTACAO_CHECKPOINT_PATH 56 | VOCAB_SIZE=$OSTENTACAO_VOCAB_SIZE 57 | fi 58 | 59 | python -u sample.py \ 60 | --checkpoint-path=${CHECKPOINT_PATH} \ 61 | --index2word-path=${INDEX2WORD_PATH} \ 62 | --word2index-path=${WORD2INDEX_PATH} \ 63 | --vocab-size=${VOCAB_SIZE} \ 64 | --embedding-size=${EMBEDDING_SIZE} \ 65 | --num-layers=${NUM_LAYERS} \ 66 | --num-units=${NUM_UNITS} \ 67 | --temperature=${TEMPERATURE} 68 | -------------------------------------------------------------------------------- /scripts/create_wordcloud_graph.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | #usage: ./scripts/create_word_cloud.sh 6 | 7 | ALL_SONGS_PATH='data/song_dataset/all_songs.pkl' 8 | ALL_GRAPH_NAME='all-word-cloud-graph.png' 9 | 10 | KONDZILLA_SONGS_PATH='kondzilla/song_dataset/all_songs.pkl' 11 | KONDZILLA_GRAPH_NAME='kondzilla-word-cloud-graph.png' 12 | 13 | PROIBIDAO_SONGS_PATH='proibidao-songs/song_dataset/all_songs.pkl' 14 | PROIBIDAO_GRAPH_NAME='proibidao-word-cloud-graph.png' 15 | 16 | OSTENTACAO_SONGS_PATH='ostentacao-songs/song_dataset/all_songs.pkl' 17 | OSTENTACAO_GRAPH_NAME='ostentacao-word-cloud-graph.png' 18 | 19 | 20 | PARAM=${1:-all} 21 | if [ $PARAM == "all" ]; then 22 | echo "Creating word cloud graph for all songs" 23 | SONGS_PATH=$ALL_SONGS_PATH 24 | GRAPH_NAME=$ALL_GRAPH_NAME 25 | elif [ $PARAM == "kondzilla" ]; then 26 | echo "Creating word cloud graph for kondzilla songs" 27 | SONGS_PATH=$KONDZILLA_SONGS_PATH 28 | GRAPH_NAME=$KONDZILLA_GRAPH_NAME 29 | elif [ $PARAM == "proibidao" ]; then 30 | echo "Creating word cloud graph for proibidao songs" 31 | SONGS_PATH=$PROIBIDAO_SONGS_PATH 32 | GRAPH_NAME=$PROIBIDAO_GRAPH_NAME 33 | elif [ $PARAM == "ostentacao" ]; then 34 | echo "Creating word cloud graph for ostentacao songs" 35 | SONGS_PATH=$OSTENTACAO_SONGS_PATH 36 | GRAPH_NAME=$OSTENTACAO_GRAPH_NAME 37 | fi 38 | 39 | python -u song_word_cloud.py \ 40 | --songs-path=${SONGS_PATH} \ 41 | --graph-name=${GRAPH_NAME} 42 | -------------------------------------------------------------------------------- /scripts/download_musics.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #usage: ./scripts/download_musics.sh 4 | 5 | set -e 6 | 7 | DATA_FOLDER='data/' 8 | KEY_FILE_PATH=$DATA_FOLDER'config' 9 | CODE_FILES_NAME='song_codes.txt' 10 | 11 | python vagalume_downloader.py \ 12 | --key-file-path=${KEY_FILE_PATH} \ 13 | --data-folder=${DATA_FOLDER} \ 14 | --code-files-name=${CODE_FILES_NAME} 15 | -------------------------------------------------------------------------------- /scripts/run_format_data.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #usage: ./scripts/run_format_data.sh 4 | 5 | set -e 6 | 7 | DATA_FOLDER='data' 8 | DATASET_SAVE_PATH='song_dataset' 9 | MIN_FREQUENCY=5 10 | VALIDATION_PERCENT=0.0 11 | TEST_PERCENT=0.0 12 | 13 | 14 | python format_data.py \ 15 | --data-folder=${DATA_FOLDER} \ 16 | --dataset-save-path=${DATASET_SAVE_PATH} \ 17 | --min-frequency=${MIN_FREQUENCY} \ 18 | --validation-percent=${VALIDATION_PERCENT} \ 19 | --test-percent=${TEST_PERCENT} 20 | -------------------------------------------------------------------------------- /scripts/run_model.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | set -e 4 | 5 | #usage: ./scripts/run_model.sh 6 | 7 | ALL_TRAIN_FILE='data/song_dataset/train/train.tfrecord' 8 | ALL_VALIDATION_FILE='data/song_dataset/validation/validation.tfrecord' 9 | ALL_TEST_FILE='data/song_dataset/test/test.tfrecord' 10 | ALL_INDEX2WORD_PATH='data/song_dataset/index2word.pkl' 11 | ALL_WORD2INDEX_PATH='data/song_dataset/word2index.pkl' 12 | ALL_CHECKPOINT_PATH='checkpoint' 13 | ALL_VOCAB_SIZE=12551 14 | 15 | KONDZILLA_TRAIN_FILE='kondzilla/song_dataset/train/train.tfrecord' 16 | KONDZILLA_VALIDATION_FILE='kondzilla/song_dataset/validation/validation.tfrecord' 17 | KONDZILLA_TEST_FILE='kondzilla/song_dataset/test/test.tfrecord' 18 | KONDZILLA_INDEX2WORD_PATH='kondzilla/song_dataset/index2word.pkl' 19 | KONDZILLA_WORD2INDEX_PATH='kondzilla/song_dataset/word2index.pkl' 20 | KONDZILLA_CHECKPOINT_PATH='kondzilla_checkpoint' 21 | KONDZILLA_VOCAB_SIZE=2192 22 | 23 | PROIBIDAO_TRAIN_FILE='proibidao-songs/song_dataset/train/train.tfrecord' 24 | PROIBIDAO_VALIDATION_FILE='proibidao-songs/song_dataset/validation/validation.tfrecord' 25 | PROIBIDAO_TEST_FILE='proibidao-songs/song_dataset/test/test.tfrecord' 26 | PROIBIDAO_INDEX2WORD_PATH='proibidao-songs/song_dataset/index2word.pkl' 27 | PROIBIDAO_WORD2INDEX_PATH='proibidao-songs/song_dataset/word2index.pkl' 28 | PROIBIDAO_CHECKPOINT_PATH='proibidao_checkpoint' 29 | PROIBIDAO_VOCAB_SIZE=1445 30 | 31 | OSTENTACAO_TRAIN_FILE='ostentacao-songs/song_dataset/train/train.tfrecord' 32 | OSTENTACAO_VALIDATION_FILE='ostentacao-songs/song_dataset/validation/validation.tfrecord' 33 | OSTENTACAO_TEST_FILE='ostentacao-songs/song_dataset/test/test.tfrecord' 34 | OSTENTACAO_INDEX2WORD_PATH='ostentacao-songs/song_dataset/index2word.pkl' 35 | OSTENTACAO_WORD2INDEX_PATH='ostentacao-songs/song_dataset/word2index.pkl' 36 | OSTENTACAO_CHECKPOINT_PATH='ostentacao_checkpoint' 37 | OSTENTACAO_VOCAB_SIZE=2035 38 | 39 | USE_CHECKPOINT=1 40 | 41 | LEARNING_RATE=0.002 42 | NUM_EPOCHS=20 43 | BATCH_SIZE=32 44 | 45 | NUM_LAYERS=3 46 | NUM_UNITS=728 47 | EMBEDDING_SIZE=300 48 | MIN_VAL=-1 49 | MAX_VAL=1 50 | 51 | EMBEDDING_DROPOUT=0.5 52 | LSTM_OUTPUT_DROPOUT=0.5 53 | LSTM_STATE_DROPOUT=0.5 54 | LSTM_INPUT_DROPOUT=0.5 55 | WEIGHT_DECAY=0.0000 56 | 57 | NUM_BUCKETS=30 58 | BUCKET_WIDTH=30 59 | PREFETCH_BUFFER=8 60 | PERFORM_SHUFFLE=1 61 | 62 | PARAM=${1:-all} 63 | if [ $PARAM == "all" ]; then 64 | echo "Running model for all songs" 65 | TRAIN_FILE=$ALL_TRAIN_FILE 66 | VALIDATION_FILE=$ALL_VALIDATION_FILE 67 | TEST_FILE=$ALL_TEST_FILE 68 | INDEX2WORD_PATH=$ALL_INDEX2WORD_PATH 69 | WORD2INDEX_PATH=$ALL_WORD2INDEX_PATH 70 | CHECKPOINT_PATH=$ALL_CHECKPOINT_PATH 71 | VOCAB_SIZE=$ALL_VOCAB_SIZE 72 | elif [ $PARAM == "kondzilla" ]; then 73 | echo "Running model for kondzilla songs" 74 | TRAIN_FILE=$KONDZILLA_TRAIN_FILE 75 | VALIDATION_FILE=$KONDZILLA_VALIDATION_FILE 76 | TEST_FILE=$KONDZILLA_TEST_FILE 77 | INDEX2WORD_PATH=$KONDZILLA_INDEX2WORD_PATH 78 | WORD2INDEX_PATH=$KONDZILLA_WORD2INDEX_PATH 79 | CHECKPOINT_PATH=$KONDZILLA_CHECKPOINT_PATH 80 | VOCAB_SIZE=$KONDZILLA_VOCAB_SIZE 81 | elif [ $PARAM == "proibidao" ]; then 82 | echo "Running model for probidao songs" 83 | TRAIN_FILE=$PROIBIDAO_TRAIN_FILE 84 | VALIDATION_FILE=$PROIBIDAO_VALIDATION_FILE 85 | TEST_FILE=$PROIBIDAO_TEST_FILE 86 | INDEX2WORD_PATH=$PROIBIDAO_INDEX2WORD_PATH 87 | WORD2INDEX_PATH=$PROIBIDAO_WORD2INDEX_PATH 88 | CHECKPOINT_PATH=$PROIBIDAO_CHECKPOINT_PATH 89 | VOCAB_SIZE=$PROIBIDAO_VOCAB_SIZE 90 | elif [ $PARAM == "ostentacao" ]; then 91 | echo "Running model for ostentacao songs" 92 | TRAIN_FILE=$OSTENTACAO_TRAIN_FILE 93 | VALIDATION_FILE=$OSTENTACAO_VALIDATION_FILE 94 | TEST_FILE=$OSTENTACAO_TEST_FILE 95 | INDEX2WORD_PATH=$OSTENTACAO_INDEX2WORD_PATH 96 | WORD2INDEX_PATH=$OSTENTACAO_WORD2INDEX_PATH 97 | CHECKPOINT_PATH=$OSTENTACAO_CHECKPOINT_PATH 98 | VOCAB_SIZE=$OSTENTACAO_VOCAB_SIZE 99 | fi 100 | 101 | 102 | python -u model.py \ 103 | --train-file=${TRAIN_FILE} \ 104 | --validation-file=${VALIDATION_FILE} \ 105 | --test-file=${TEST_FILE} \ 106 | --checkpoint-path=${CHECKPOINT_PATH} \ 107 | --use-checkpoint=${USE_CHECKPOINT} \ 108 | --index2word-path=${INDEX2WORD_PATH} \ 109 | --word2index-path=${WORD2INDEX_PATH} \ 110 | --num-epochs=${NUM_EPOCHS} \ 111 | --batch-size=${BATCH_SIZE} \ 112 | --learning-rate=${LEARNING_RATE} \ 113 | --num-layers=${NUM_LAYERS} \ 114 | --num-units=${NUM_UNITS} \ 115 | --vocab-size=${VOCAB_SIZE} \ 116 | --embedding-size=${EMBEDDING_SIZE} \ 117 | --embedding-dropout=${EMBEDDING_DROPOUT} \ 118 | --lstm-output-dropout=${LSTM_OUTPUT_DROPOUT} \ 119 | --lstm-input-dropout=${LSTM_INPUT_DROPOUT} \ 120 | --lstm-state-dropout=${LSTM_STATE_DROPOUT} \ 121 | --weight-decay=${WEIGHT_DECAY} \ 122 | --min-val=${MIN_VAL} \ 123 | --max-val=${MAX_VAL} \ 124 | --num-buckets=${NUM_BUCKETS} \ 125 | --bucket-width=${BUCKET_WIDTH} \ 126 | --prefetch-buffer=${PREFETCH_BUFFER} \ 127 | --perform-shuffle=${PERFORM_SHUFFLE} 128 | -------------------------------------------------------------------------------- /scripts/run_vagalume_crawler.sh: -------------------------------------------------------------------------------- 1 | #!/bin/bash 2 | 3 | #usage: ./scripts/run_vagalume_crawler.sh 4 | 5 | set -e 6 | 7 | DATA_FOLDER='data/' 8 | ARTIST_LIST_PATH=$DATA_FOLDER'artist_list.txt' 9 | 10 | 11 | python vagalume_crawler.py \ 12 | --data-folder=${DATA_FOLDER} \ 13 | --artist-list-path=${ARTIST_LIST_PATH} 14 | -------------------------------------------------------------------------------- /song_word_cloud.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | import nltk 3 | import pickle 4 | 5 | from PIL import Image 6 | import numpy as np 7 | 8 | from wordcloud import WordCloud 9 | 10 | 11 | def create_argparse(): 12 | argument_parser = argparse.ArgumentParser() 13 | 14 | argument_parser.add_argument('-sp', 15 | '--songs-path', 16 | type=str, 17 | help='Location of the all songs pickle file') 18 | 19 | argument_parser.add_argument('-gn', 20 | '--graph-name', 21 | type=str, 22 | help='Name of the word cloud graph') 23 | 24 | return argument_parser 25 | 26 | 27 | def create_songs_str(songs_path): 28 | stopwords = nltk.corpus.stopwords.words('portuguese') 29 | more_words = ['pra', 'tá', 'pode', 'tô', 'hoje', 'Então', 'então', 'agora', 30 | 'tudo', 'porque', 'sempre', 'quero', 'quer', 'sei', 'Refrão', 31 | '2x', 'assim', 'aqui', 'todo', 'vai', 'vem', 'nóis', 'vou', 32 | 'pro', 'ser', 'nois', 'ter', 'tao', 'la', 'tão', 'ta'] 33 | stopwords.extend(more_words) 34 | 35 | with open(songs_path, 'rb') as f: 36 | all_songs = pickle.load(f) 37 | 38 | songs = [' '.join(s[1:-1]) for s in all_songs] 39 | all_songs_str = '\n'.join(songs) 40 | stopwords = set(stopwords) 41 | 42 | return all_songs_str, stopwords 43 | 44 | 45 | def create_word_cloud_graph(all_songs_str, stopwords, graph_name): 46 | sarrada_mask = np.array(Image.open("masks/romano.png")) 47 | wc = WordCloud(background_color="white", max_words=2000, mask=sarrada_mask, 48 | stopwords=stopwords) 49 | wc.generate(all_songs_str) 50 | wc.to_file(graph_name) 51 | 52 | 53 | def main(): 54 | argument_parser = create_argparse() 55 | user_args = vars(argument_parser.parse_args()) 56 | songs_path = user_args['songs_path'] 57 | graph_name = user_args['graph_name'] 58 | 59 | all_songs_str, stopwords = create_songs_str(songs_path) 60 | create_word_cloud_graph(all_songs_str, stopwords, graph_name) 61 | 62 | 63 | if __name__ == '__main__': 64 | main() 65 | -------------------------------------------------------------------------------- /utils/__init__.py: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/lucasmoura/funk_generator/7b3b978d25ea8731cf37e519a41c59116f1fc0c6/utils/__init__.py -------------------------------------------------------------------------------- /utils/progress_bar.py: -------------------------------------------------------------------------------- 1 | import time 2 | import sys 3 | 4 | import numpy as np 5 | 6 | 7 | class Progbar(): 8 | """ 9 | Progbar class copied from keras (https://github.com/fchollet/keras/) 10 | Displays a progress bar. 11 | # Arguments 12 | target: Total number of steps expected. 13 | interval: Minimum visual progress update interval (in seconds). 14 | """ 15 | 16 | def __init__(self, target, width=30, verbose=1): 17 | self.width = width 18 | self.target = target 19 | self.sum_values = {} 20 | self.unique_values = [] 21 | self.start = time.time() 22 | self.total_width = 0 23 | self.seen_so_far = 0 24 | self.verbose = verbose 25 | 26 | def update(self, current, values=None, exact=None): 27 | """ 28 | Updates the progress bar. 29 | # Arguments 30 | current: Index of current step. 31 | values: List of tuples (name, value_for_last_step). 32 | The progress bar will display averages for these values. 33 | exact: List of tuples (name, value_for_last_step). 34 | The progress bar will display these values directly. 35 | """ 36 | values = values or [] 37 | exact = exact or [] 38 | 39 | for k, v in values: 40 | if k not in self.sum_values: 41 | self.sum_values[k] = [v * (current - self.seen_so_far), current - self.seen_so_far] 42 | self.unique_values.append(k) 43 | else: 44 | self.sum_values[k][0] += v * (current - self.seen_so_far) 45 | self.sum_values[k][1] += (current - self.seen_so_far) 46 | for k, v in exact: 47 | if k not in self.sum_values: 48 | self.unique_values.append(k) 49 | self.sum_values[k] = [v, 1] 50 | self.seen_so_far = current 51 | 52 | now = time.time() 53 | if self.verbose == 1: 54 | prev_total_width = self.total_width 55 | sys.stdout.write("\b" * prev_total_width) 56 | sys.stdout.write("\r") 57 | 58 | numdigits = int(np.floor(np.log10(self.target))) + 1 59 | barstr = '%%%dd/%%%dd [' % (numdigits, numdigits) 60 | bar = barstr % (current, self.target) 61 | prog = float(current)/self.target 62 | prog_width = int(self.width*prog) 63 | if prog_width > 0: 64 | bar += ('='*(prog_width-1)) 65 | if current < self.target: 66 | bar += '>' 67 | else: 68 | bar += '=' 69 | bar += ('.'*(self.width-prog_width)) 70 | bar += ']' 71 | sys.stdout.write(bar) 72 | self.total_width = len(bar) 73 | 74 | if current: 75 | time_per_unit = (now - self.start) / current 76 | else: 77 | time_per_unit = 0 78 | eta = time_per_unit*(self.target - current) 79 | info = '' 80 | if current < self.target: 81 | info += ' - ETA: %ds' % eta 82 | else: 83 | info += ' - %ds' % (now - self.start) 84 | for k in self.unique_values: 85 | if isinstance(self.sum_values[k], list): 86 | info += ' - %s: %.4f' % ( 87 | k, self.sum_values[k][0] / max(1, self.sum_values[k][1])) 88 | else: 89 | info += ' - %s: %s' % (k, self.sum_values[k]) 90 | 91 | self.total_width += len(info) 92 | if prev_total_width > self.total_width: 93 | info += ((prev_total_width-self.total_width) * " ") 94 | 95 | sys.stdout.write(info) 96 | sys.stdout.flush() 97 | 98 | if current >= self.target: 99 | sys.stdout.write("\n") 100 | 101 | if self.verbose == 2: 102 | if current >= self.target: 103 | info = '%ds' % (now - self.start) 104 | for k in self.unique_values: 105 | info += ' - %s: %.4f' % ( 106 | k, self.sum_values[k][0] / max(1, self.sum_values[k][1])) 107 | sys.stdout.write(info + "\n") 108 | 109 | def add(self, n, values=None): 110 | self.update(self.seen_so_far+n, values) 111 | -------------------------------------------------------------------------------- /utils/session_manager.py: -------------------------------------------------------------------------------- 1 | import contextlib 2 | import os 3 | 4 | import tensorflow as tf 5 | 6 | 7 | @contextlib.contextmanager 8 | def initialize_session(user_config, use_gpu=True): 9 | if use_gpu: 10 | config = tf.ConfigProto() 11 | config.gpu_options.allow_growth = True 12 | else: 13 | config = tf.ConfigProto(device_count={'GPU': 0}) 14 | 15 | checkpoint = tf.train.latest_checkpoint(user_config.checkpoint_path) 16 | saver = tf.train.Saver() 17 | 18 | with tf.Session(config=config) as sess: 19 | if user_config.use_checkpoint: 20 | print('Load checkpoint: {}'.format(checkpoint)) 21 | saver.restore(sess, checkpoint) 22 | else: 23 | print('Creating new model') 24 | if not os.path.exists(user_config.checkpoint_path): 25 | os.makedirs(user_config.checkpoint_path) 26 | 27 | sess.run(tf.global_variables_initializer()) 28 | 29 | yield (sess, saver) 30 | -------------------------------------------------------------------------------- /vagalume_crawler.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from crawler.music_crawler import MusicCrawler 4 | 5 | 6 | def create_argparse(): 7 | parser = argparse.ArgumentParser() 8 | 9 | parser.add_argument('-d', 10 | '--data-folder', 11 | type=str, 12 | help='Data folder path') 13 | 14 | parser.add_argument('-al', 15 | '--artist-list-path', 16 | type=str, 17 | help='Path of the file containing the artists') 18 | 19 | return parser 20 | 21 | 22 | def main(): 23 | parser = create_argparse() 24 | user_args = vars(parser.parse_args()) 25 | 26 | artist_list_path = user_args['artist_list_path'] 27 | data_folder = user_args['data_folder'] 28 | 29 | music_crawler = MusicCrawler(artist_list_path, data_folder) 30 | music_crawler.crawl_musics() 31 | 32 | 33 | if __name__ == '__main__': 34 | main() 35 | -------------------------------------------------------------------------------- /vagalume_downloader.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | from crawler.music_crawler import MusicDownloader 4 | 5 | 6 | def create_argparse(): 7 | parser = argparse.ArgumentParser() 8 | 9 | parser.add_argument('-kfp', 10 | '--key-file-path', 11 | type=str, 12 | help='Location of the file containing the Vagalume API key') 13 | 14 | parser.add_argument('-df', 15 | '--data-folder', 16 | type=str, 17 | help='Location of the data files') 18 | 19 | parser.add_argument('-cfn', 20 | '--code-files-name', 21 | type=str, 22 | help='Name of the file that contains the music ids') 23 | 24 | return parser 25 | 26 | 27 | def main(): 28 | parser = create_argparse() 29 | user_args = vars(parser.parse_args()) 30 | 31 | key_file_path = user_args['key_file_path'] 32 | data_folder = user_args['data_folder'] 33 | code_files_name = user_args['code_files_name'] 34 | 35 | music_downloader = MusicDownloader(key_file_path, data_folder, code_files_name) 36 | music_downloader.download_all_songs() 37 | 38 | 39 | if __name__ == '__main__': 40 | main() 41 | --------------------------------------------------------------------------------